USING  DECISION  TREES  AND  FEATURE  CONSTRUCTION 

TO  DESCRIBE  CHANGING  CONSUMER  LIFE-STYLES 

AND  EXPECTATIONS 


By 

RAYMOND  L.  MAJOR 


A  DISSERTATION  PRESENTED  TO  THE  GRADUATE  SCHOOL 

OF  THE  UNIVERSITY  OF  FLORIDA  IN  PARTIAL  FULFILLMENT 

OF  THE  REQUIREMENTS  FOR  THE  DEGREE  OF 

DOCTOR  OF  PHILOSOPHY 

UNIVERSITY  OF  FLORIDA 

1994 


Copyright  1994 

by 

Raymond  L.  Major 


To  my  mother.  Pearl  Simmons  Major,  for  all  her  encouragement  and  support,  and, 
to  my  deceased  father,  John  Willie  Major,  for  his  wisdom  in  helping  me  face  the 
challenges  of  life. 


ACKNOWLEDGEMENTS 

I  am  deeply  indebted  to  many  people  who  helped  my  dream  become  a  reality. 
First  of  all,  my  sincerest  thanks  go  to  Dr.  Israel  Tribble  Jr.  and  all  of  his  staff  at  the 
Florida  Education  Fund.  Their  guidance,  moral,  and  especially  financial  support  truly 
helped  me  cope  with  many  of  the  problems  and  frustrations  1  encountered  during  my 
graduate  experience  at  the  University  of  Florida. 

I  am  also  thankful  to  all  of  my  committee  members  Professors  Gary  Koehler  and 
Selwyn  Piramuthu  for  being  the  chair  and  cochair  of  my  committee;  Professors  Selcuk 
Erenguc  and  Pat  Thompson  for  their  guidance  and  support;  and  Professor  Dave  Denslow 
for  his  friendship  and  helpful  insights. 

Thanks  go  to  members  in  the  College  of  Business  who  have  helped  me  in  one  way 
or  another.  Professors  Harold  Benson  and  Richard  Elnicki  helped  me  in  adapting  to  the 
world  of  academia.  Professor  Henry  Tosi  in  the  Department  of  Management  helped  me 
make  that  important  decision  to  enter  a  Ph.D.  program.  1  will  always  be  grateful  for  his 
friendship.  Professor  Sanford  Berg  in  the  Department  of  Economics  helped  me  see  the 
beauty  of  doing  quality  research.  Thanks  also  go  to  the  Director  and  staff  of  the  Bureau 
of  Economic  and  Business  Research  who  provided  many  of  the  resources  1  required  for 
performing  my  research. 


IV 


Thanks  are  also  due  to  Dr.  Max  Parker  in  the  College  of  Education.  His 
accomplishments  and  enthusiasm  had  a  significant  influence  on  me.  Dr.  Roderick 
McDavis,  also  of  the  College  of  Education,  was  extremely  helpful  in  my  acclimation  to 
the  culture  of  the  University  of  Florida's  Graduate  School-- Thanks  Rod'. 

Finally,  my  most  sincerest  thank  go  to  my  family  members.  I  am  most  grateful 
to  my  daughters,  Deborah  and  Brenda.  Their  love,  understanding  and  support  truly 
helped  in  making  my  graduate  experience  as  a  single  parent  a  very  positive  one.  I  am 
also  thankful  to  be  blessed  with  very  supportive  and  encouraging  siblings:  Sam,  Jimmie, 
Lamar,  Larry,  and  Sherrian.    I  thank  them  for  sharing  my  agonies  and  ecstasies. 


TABLE  OF  CONTENTS 

ACKNOWLEDGEMENTS iv 

LIST  OF  TABLES x 

LIST  OF  FIGURES xii 

Abstract    xiv 

CHAPTER  1                                     INTRODUCTION I 

1.1  The  1990  -  1991  Recession    1 

1.2  Survey  Measures  of  Consumer  Confidence 4 

1.2.1  National  and  Statewide  Business-Surveys 4 

1.2.2  BEBR  Survey  Data    6 

1.2.3  Consumer  Confidence  Metrics 11 

1.3  Consumer  Expectations  and  Buying  Plans    15 

1.4  AI  and  Feature  Construction 17 

1.4.1  Time-Complexity  of  Feature  Construction 20 

1.4.2  DUALTREE  Feature  Construction 22 

1.4.2.1    Dual  decision  trees    24 

1.4.3  ID3/C4.5  Decision  Trees  Using  BEBR  Sample  Data    29 

1.5  Thesis  and  Objectives    42 

1.5.1  Problem  Definition 43 

1.5.2  Problem  Resolution    43 

1.5.3  Implementation  and  Experimentation    44 

1.6  Dissertation  Outline 45 

CHAPTER  2              DESCRIBING  CONSUMER  HOUSEHOLDS    46 

2. 1  Consumer  Consumption  of  Durable  Goods 48 

2.1.1    Estimating  a  Demand  for  Commodities 49 

2.2  Survey  Data  of  Consumer  Households 52 

2.2.1    BEBR  Survey  of  Consumer  Confidence    54 

2.2.1.1    BEBR  index  components 55 

2.3  Describing  Purchasers  of  Durable  Goods 56 

2.3.1    Experimental  Design  using  the  BEBR  Business  Surveys  ....  61 


VI 


CHAPTER  3  MACHINE  LEARNING 64 

3.1  Background  in  Artificial  Intelligence 64 

3.2  Reasoning  Systems 64 

3.3  Machine  Learning 66 

3.3.1  Views  of  Learning 66 

3.3.2  A  Model  of  a  Learning  Machine    67 

3.3.3  Learning  Strategies 69 

3.3.4  Learning  Theories    72 

3.3.5  Learning  Algorithms 74 

3.3.5. 1  Representation  of  learned  concepts    74 

3.3.5.2  Incremental  and  Non-Incremental  Learning    75 

3.3.5.3  Dealing  with  uncertainty    76 

3.3.5.4  Learning  single  or  multiple  concepts 76 

3.3.5.5  Algorithm's  search  strategy    76 

3.3.5.6  Concept  formation  goals 77 

3.3.5.7  Application  domain 79 

3.3.5.8  Criterion 80 

3.4  Concept  Description  Languages    80 

3.4.1  Binary  Trees   83 

3.4.2  Decision  Trees 85 

3.4.3  Decision  Lists    86 

3.5  C4.5  Machine  Learning  Programs 87 

CHAPTER  4  FEATURE  CONSTRUCTION     90 

4.1  Decision  Trees  and  Feature  Construction 90 

4.2  Feature-Construction  Algorithms 92 

4.2.1  Complexity  Measures 93 

4.2.2  CITRE    95 

4.2.3  FRINGE  and  Dual  FRINGE    96 

4.2.3.1    FRINGE  feature  construction    98 

4.3  Time  Complexity  Models 99 

4.3.1  Probabilistic  Models 100 

4.3.2  Algorithmic  Models    101 

4.3.2.1    Bounded  rank  decision  trees 102 

4.3.3  Research  Model 105 

4.3.3.1    Tree-construction  component    107 

4.4  Finding  New  Features 110 

4.4.1    Searching  a  Feature  Space 110 

4.5  Feature-Representation  Models 116 

4.5.1  OCCAM'S  RAZOR    117 

4.5.2  MDLP    118 

4.5.3  Boolean  Formulae    119 

4.5.3.1  Representation  classes    120 

4.5.3.2  Computation  models    120 


vu 


4.6  Feature-Construction  Models    122 

4.6.1  Exhaustive  Approach    122 

4.6.2  Binary  Tree  Construction    125 

4.7  Dual  Trees 126 

4.7.1    Properties  of  Dual  Trees 127 

4.8  DUALTREE  Feature  Construction    129 

4.8.1  Feature  Construction 130 

4.8.2  Validation  and  Verification 131 

4.8.2.1  Procedural  framework    134 

4.8.2.2  Claims  and  proofs 135 

4.8.3  Extensions  for  DUALTREE     137 

4.8.3.1  Binarizing  nominal  and  continuous  data 137 

4.8.3.2  Forming  features  with  binarized  data 139 

CHAPTER  5        DUALTREE's  DESIGN  AND  IMPLEMENTATION 141 

5.1  DUALTREE's  Representation  Model    144 

5.1.1  DUALTREE's  Adjacency-Structure    145 

5.1.2  Searching  and  Sorting  Feature  Names 146 

5.2  Graph  Processing  in  DUALTREE 148 

5.3  Feature  Construction  with  DUALTREE 149 

5.3.1  Building  Class  Successors 150 

5.3.1.1    Finding  features    152 

5.3.2  Building  the  Dual  of  the  Dual 154 

5.3.2.1    Finding  terminal  features    156 

5.3.3  Finding  New  Features 157 

5.4  DUALTREE's  Time  Complexity     158 

CHAPTER  6                                      EXPERIMENTS    159 

6.1  Experimental  Design 160 

6.1.1  Experimental  Technique    161 

6.1.2  Presentation  of  Results    163 

6.2  Feature  Construction  Using  Binary  Data    163 

6.2.1  DNF  Functions  Test  Results    166 

6.2.1.1    Useful  features  of  DNF  functions    169 

6.2.2  Multiplexor  and  Parity  Functions  Test  Results 173 

6.2.2.1        Useful    features    of    multiplexor    and    parity 

functions 176 

6.2.3  Summary  of  Results  for  Binary  Data    177 

6.3  Feature  Construction  Using  Nominal  Data 182 

6.3.1    Test  Results  using  Nominal  Data 185 

6.4  Feature  Construction  With  Continuous  Data 189 

6.4.1    Results  using  Continuous  Data 190 

6.5  DUALTREE  Descriptions  of  Consumer  Life-Styles 192 

6.5.1    Empirical  Design  and  Method    193 


vui 


6.5.2  Demographic  Descriptions  of  Financial  Conditions 195 

6.5.2.1    Descriptions  of  'the  same'  and  unsure  consumers   .  201 

6.5.3  Describing  Consumer  Buying  Plans    209 

6.6    Complexity  Results 214 

CHAPTER  7                                     CONCLUSIONS    219 

7.1  Summary 219 

7.2  Attainment  of  Goals    222 

7.2.1  Problem  Definition 222 

7.2.2  Problem  Resolution    224 

7.2.3  Implementation  and  Experiments 226 

7.3  Future  Research    228 

7.3.1  Improved  Model  Development 228 

7.3.2  Problems  in  the  Study  of  Feature  Construction    229 

APPENDIX  A 

BEBR  SURVEY  QUESTIONS  AND  VARIABLE  ASSIGNMENTS    .  .  230 

APPENDIX  B              DUALTREE  SOURCE-CODE  LISTINGS    233 

REFERENCE  LIST 242 

BIOGRAPHICAL  SKETCH 250 


IX 


LIST  OF  TABLES 

Table  LI    Distribution  of  respondents'  answers  for  buying  a  car 38 

Table  2. 1    Income  and  age  distributions  of  financial  confidence    59 

Table  2.2   Sex  and  party  distributions  of  confidence    61 

Table  6.1    Boolean  target  functions    165 

Table  6.2   Class  distributions  for  binary  data-sets    166 

Table  6.3    C4.5  results  using  DUALTREE's  features 168 

Table  6.4   Feature-formation  results  for  DNF  functions    170 

Table  6.5    C4.5  results  using  multiplexor  and  parity  functions 175 

Table  6.6   Multiplexor  and  parity  feature-formation  results 177 

Table  6.7    C4.5  results  using  nominal  data-sets 186 

Table  6.8   Feature-formation  results  using  nominal  data-sets 188 

Table  6.9   Continuous  and  nominal  data  results 192 

Table  6.10  Training  and  test  set  class-distributions 194 

Table  6.11  C4.5  results  using  BEBR  data-sets 196 

Table  6.12  Confusion  matrices  for  the  four  data  sets    198 

Table  6.13  Features  in  a  virtual  tree  for  the  SAME  class    202 

Table  6.14  Demographic  descriptions  of  consumer  buying  plans 204 

Table  6.15  Descriptions  of  'better'  and  unsure  respondents 206 


Table  6.16  Other  'better'  and  unsure  consumer-descriptions    207 

Table  6.17  Other  consumer  descriptions  regarding  buying  plans    208 

Table  6.18  Sample  distributions  using  GBTIME    212 

Table  6.19  More  C4.5  results  for  the  BEBR  data 213 

Table  6.20  Confusion  matrices  for  consumer  views  on  buying    214 

Table  6.21  A  list  of  consumer  descriptions 215 

Table  6.22  Complexity  results 218 


XI 


LIST  OF  FIGURES 

Figure  1.1    Measures  of  consumer  confidence 5 

Figure  1.2   Two  measures  of  personal  financial  expectations 6 

Figure  1.3    Consumer  expectations  of  personal  finances 8 

Figure  1 .4   Percentages  of  financially  unsure  respondents 9 

Figure  1.5   Expectations  of  national  conditions  and  buying  plans    10 

Figure  1.6    Indexes  determined  by  'don't  know'  combinations    12 

Figiu"e  1.7   Consumer  car  purchase  plans   16 

Figure  1.8    A  decision  tree  and  its  dual    25 

Figure  1.9   The  re-oriented  dual  tree    26 

Figure  1.10  Forming  features  in  the  dual  tree    29 

Figure  1 . 1 1  A  dual  tree  after  feature  construction    30 

Figure  1.12  The  dual  of  the  dual  tree 31 

Figure  1.13  Decision  trees  with  rank  equal  to  1 32 

Figure  1.14  A  decision  tree  for  the  BEBR  data 39 

Figure  1.15  A  decision  tree  using  DUALTREE  features 41 

Figure  2.1    CCI's  for  Jan  '92  -  May  '93 56 

Figure  2.2    Consumers'  financial  confidence  during  1992 58 

Figure  3.1    Learning  machine  model    68 


Xll 


Figure  3.2   Learning  algorithm  considerations 74 

Figure  4.1    A  decision  tree 99 

Figure  4.2    A  reduced  decision  tree  with  rank  =  1 102 

Figure  4.3    3"*  smallest  decision  trees  of  rank  1 103 

Figure  4.4   The  re-oriented  dual  tree    128 

Figure  6.1    Comparison  of  features  used  to  features  formed    171 

Figure  6.2   Comparison  of  edges  with  features  to  total  edges    173 

Figure  6.3    Feature  formation-rates  of  multiplexor  and  parity  functions    178 

Figure  6.4   Edge-usage  results  using  multiplexor  and  parity  functions    179 

Figure  6.5   Performance  results  for  binary  data-sets 181 

Figure  6.6   Tests  using  DUALTREE  features   183 


xui 


Abstract  of  Dissertation  Presented  to  the  Graduate  School 

of  the  University  of  Florida  in  Panial  Fulfillment  of  the 

Requirements  for  the  Degree  of  Doctor  of  Philosophy 

USING  DECISION  TREES  AND  FEATURE  CONSTRUCTION 

TO  DESCRIBE  CHANGING  CONSUMER  LIFE-STYLES 

AND  EXPECTATIONS 

By 

Raymond  L.  Major 

August  1994 

Chairperson:  Gary  J.  Koehler 

Major  Department:  Decision  and  Information  Sciences 

Using  artificial  intelligence  methods  to  acquire  expert  knowledge  inductively  is 

a  key  area  of  interest  in  Expert-Systems  Development.  This  dissertation  investigates  the 

theoretical  properties  of  feature-construction  learning  algorithms  and  uses  them  to  develop 

an  empirical  model  to  examine  several  issues  related  to  the   1990-1991   recession. 

Empirical  results  show  that  feature  construction  can  improve  the  performance  of  an 

induced   decision   tree.      We  develop   an   analytical   model   of  learning   with   feature 

construction.  Our  model  characterizes  the  time  complexity  of  learning  boolean  functions 

with  polynomial  size  DNF  expressions,  when  bounded-rank  decision  trees  are  used  as  a 

concept  description  language.  Results  show  that  limiting  the  number  of  new  features  may 

improve  the  computational  efficiency  of  feature  construction.  Our  procedure  uses  the  dual 

of  a  decision  tree  when  forming  new  features.   We  then  use  our  empirical  model  to:  (1) 


XIV 


XV 

describe  changes  in  consumer  life-styles  and  expectations  for  time  periods  associated  with 
the  1990-1991  recession,  and,  (2)  show  that  current  practice  for  creating  quantitative 
measures  of  consumer  confidence  is  sometimes  inappropriately  used.  Finally,  we  examine 
tradeoffs  between  expert  comprehensibility  and  formal  power,  when  choosing  a 
representation  to  use  in  expert-system  applications. 


CHAPTER  1 
INTRODUCTION 


Knowledge  Acquisition  is  a  process  currently  undergoing  extensive  research  by 
many  information  scientists.  In  this  dissertation  we  do  two  things.  First,  we  develop  a 
new  method  for  feature  construction.  Next,  we  use  our  new  method  to  create  a 
knowledge  base  of  information  associated  with  a  period  of  recent  economic  activity  in  the 
United  States.  Economists  and  business  analysts  may  use  this  information  to  better 
understand  certain  decision  criteria  used  by  a  diverse  group  of  consumers.  Before 
describing  a  way  to  build  the  knowledge  base,  we  first  discuss  the  kind  of  knowledge  we 
need  and  how  this  information  can  be  used. 

1.1    The  1990  -  1991  Recession 

Economists  and  business  analysts  are  currently  exploring  many  questions  related 
to  the  1990-1991  recession  and  the  recovery  period  following  it.  The  National  Bureau 
of  Economic  Research  designates  the  last  two  quarters  of  1990  and  the  first  quarter  of 
1991  as  a  period  of  negative  growth  for  the  U.S.  economy  (Blanchard  1993;  Hall  1993). 
Suggested  causes  for  the  recent  recession  include  price  shocks,  higher  tax  rates,  decrease 
in  defense  spending,  end  of  the  Cold  War,  consumer  depression,  and  the  Iraqi  invasion 
of  Kuwait.  However,  the  literatiu'e  suggests  that  a  shock  to  consumption  largely 
determined  the  recessionary  episode  (Blanchard  1993;  Hall  1993;  Hansen  and  Prescott 

1 


2 
1993;  Perry  and  Schultze  1993).   Negative  consumption  shocks  decrease  consumption  of 

market  goods  and  services  below  trend.    Perry  and  Schultze  (1993)  state: 

We  have  been  able  to  tag  the  recent  recession  and  subsequent  sluggish  recovery 
as  clearly  unusual  in  that— unlike  its  predecessors— it  was  not  primarily  driven  by 
a  combination  of  policy  changes  and  autoregressive  responses  by  other  forces 
weakening  total  demand.  We  have  pinpointed  the  weakness  in  consumption  as  the 
most  important  locus  of  negative  shocks,  and  have  suggested  that  it  arose  in  part 
from  the  depressing  effect  on  consumer  confidence  stemming  from  weak 
employment  growth  and  from  the  unusual  prevalence  of  permanent— as  contrasted 
with  temporary-layoffs.    (193) 

The  literature  leaves  many  questions  concerning  the  consumption  shock  unanswered. 

Most  researchers  use  traditional  statistical  methods  in  their  empirical  models  for  studying 

various  questions  regarding  the  recent  recessionary  episode.  These  models  usually  require 

(1)  quantitative  information  in  the  form  of  time  series  or  cross-sectional  data,  and  (2) 

estimates  of  all  unknown  parameters.    However,  using  these  models,  it  is  sometimes 

difficult  to  examine  important  questions,  such  as  how  changes  in  attitudes  of  consumers 

before,  during,  and  after  the  recession  varied  by  age,  sex,  and  income.    Additionally, 

several  researchers  suggest  that  established  models  are  not  helpful  for  exploring  questions 

such  as  whether  changing  demographics  and  life-cycle  factors  leading  to  lower  savings 

rates  are  pardy  responsible  for  the  slow  recovery  (Hall  1993;  Hansen  and  Prescott  1993). 

One  reason  for  the  frailty  of  statistical  models  is  that  when  they  are  applied  to  15 

to  20  variables  and  the  interrelationships  between  them,  the  maintained  assumptions 

required   for  estimation   are   implausible.      Palies  and   Philip   (1989)  give   additional 

disadvantages  associated  with  traditional  statistical  models  including  the  following:  (1) 

the  analysis  can  be  qualitative  or  based  on  heuristic  rules;  (2)  they  are  of  limited  use  for 

very  short  term  forecasting;  and  (3)  classifying  the  variables  into  endogenous,  exogenous 


3 

and  out  of  model  variables  depends  on  the  feasibility  of  computing  the  resulting  models 

rather  than  on  economic  theory.  Concerning  economic  models  based  on  neoclassical 
demand  theory,  Hodgson  (1992)  suggests  that  neoclassical  economics  is  deficient  because 
of  its  narrow,  utilitarian  base,  and  because  of  its  general  treatment  of  time  and  analysis 
of  economic  processes.  Hall  (1993)  suggests  a  need  for  using  empirical  models  without 
the  neoclassical  curvature  conditions  to  examine  the  recent  recession.  Moreover,  Gianotti 
(1989)  states  that  the  trend  toward  the  formalization  of  less  structured  problems  and  the 
increased  emphasis  on  individual  attitudes  and  expectations  create  a  need  for  new 
methods  to  represent  and  manipulate  symbolic  knowledge. 

I  propose  to  develop  an  empirical  model  of  learning,  using  Artificial  Intelligence 
methods  and  techniques.  A  major  hypothesis  of  this  research  is  that  this  system  can 
produce  useful  descriptions  for  examining  questions  such  as  the  ones  previously 
mentioned,  using  data  collected  from  business  surveys  of  consumer  attitudes  and 
expectations.  Our  model  offers  the  advantage  of  being  able  to  examine  the 
interrelationships  among  a  relatively  large  set  of  variables  or  attributes.  Additionally, 
Palies  and  Philip  (1989)  state  that  a  knowledge-based  approach  gives  a  framework 
allowing  for  (1)  the  explicit  description  of  the  economic  agents  process  and  the 
economists'  behavior,  and,  (2)  dealing  with  the  quantitative  and  qualitative  scaled 
variables.  Palies  and  Philip  (1989)  overcome  certain  limitations  of  econometric  models 
by  linking  the  models  with  an  expert  system  to  explain  and  compute  several  exogenous 
variables,  taking  into  account  the  endogenous  variables  influencing  them. 


1.2   Survey  Measures  of  Consumer  Confidence 

Many  researchers  use  surveys  of  consumer  attitudes  and  expectations  as  a  source 
of  information  for  studying  the  recessionary  episode  (Blanchard  1993;  Hall  1993;  Perry 
and  Schultze  1993).  Gianotti  (1989)  describes  business  surveys  as  qualitative  and 
quantitative  data  reporting  the  opinion  of  economic  entities  (i.e.  firms,  families,  etc.) 
about  the  past  trend,  the  current  status,  and  the  expected  short-term  variations  of  several 
key  indicators.  Survey  data  of  consumer  attitudes  and  expectations  is  normally  used  to 
create  an  index  of  consumer  confidence  to  predict  changes  in  consumer  purchase-rates  of 
durable  goods  (Juster  1959).  One  usually  obtains  an  index-value  by  taking  the  mean  of 
several  component-values. 

1.2.1    National  and  Statewide  Business-Surveys 

Today,  several  indices  of  consumer  confidence  regularly  appear  in  a  variety  of 
business  publications  such  as  Business  Week  and  the  Wall  Street  Journal.  Three  widely 
publicized  national-measures  of  consumer  confidence  are  available  from  (1)  the  University 
of  Michigan,  (2)  the  Conference  Board,  and  (3)  ABC  News  and  Money  magazine.  A 
statewide  measure  is  the  Consumer  Confidence  Index  (CCI)  published  monthly  by  the 
Bureau  of  Economic  and  Business  Research  (BEBR)  at  the  University  of  Florida. 
Figure  1.1  shows  the  BEBR's  CCI  and  the  University  of  Michigan's  Index  of  Consumer 
Sentiment  (ICS)  from  the  first  quarter  of  1989  through  the  last  quarter  of  1992.  We  see 
that  both  measures  track  fairly  well  together  and  drop  from  high  to  low  levels.  The 
survey  and  the  procedure  for  constructing  the  index  employed  by  the  BEBR,  are  patterned 


110 

105^ 

100 

I        95 
c 

90 

85- 

80- 

75- 

70 

65 

60 

55 


Consumer  Confidence 

Jan  '89  -  Dec  '92 


Legend 
BEBR'sCa 
Michigan's  ICS 


Quarter 


I  I  I  )  I  I  I  I  I  I  '  1  I  I  I  I  I  I  I  I  I  I  I  I  I  I  I  I  I  I  I  I  I  I  I  I  I  I  I  I  I II 

Q1  Q2  Q3  04  Q1  Q2  Q3  Q4  Q1  Q2  Q3  CM  Q1  Q2  Q3  Q4 
1989  1990  1991  1992 


Figure  1 . 1    Measures  of  consumer  confidence 


after  the  University  of  Michigan's  national  index.  Thus,  we  may  infer  from  Figure  1.1 
that  the  confidence  of  Horidians  is  representative  of  the  confidence  of  consumers 
nationally.  Figure  1.2  shows  the  index-component-values  for  the  future-financial- 
condition  component,  used  in  constructing  each  respective  index.  From  the  figure,  we 
observe  that  the  two  curves  behave  similarly  during  the  recessionary  episode  and  have  a 
high  degree  of  correlation  between  them  throughout  the  time  period.  This  information 
strengthens  our  previous  conclusion  concerning  the  representiveness  of  Floridians.    We 


Future-Personal-Finances 


150 

145 

140 

135- 

I      130 

~      125H 

120 

115 

110 

105 

100 

95-1 

90 

85 

80 


Jan  '89  -  Dec  '92 


Legend 
BEBR's  Conqxiaent 
Michigan's  Campanet 


I  I  I  I  I  I 


I  I  I  I  1  I  I  1  I  1  1 


I  I  I  I  I  I  1  1 


Quarter 


Q1  Q2  Q3  04  Q1  Q2  Q3  Q4  Q1  Q2  Q3  04  01  02  03  04 
1989  1990  1991  1992 


Figure  1.2   Two  measures  of  personal  financial  expectations 

use  the  BEBR's  survey  data  over  the  time  period  from  1989  through  1992  for  analysis 
in  this  research. 

1.2.2    BEBR  Survev  Data 


To  construct  its  composite  index,  the  BEBR  uses  five  components.  Three  of  these 
are  indicators  of  consumer's  personal  finances  and  buying  plans,  and  the  other  two  are 
indicators  of  consumer  expectations  of  the  national  economy.   The  survey  questions  for 


7 
these  components  have  four  alternative  answers:  'better',  'same',  'worse',  and  'don't 

know',  or  'good',  'uncertain',  'bad',  and  'don't  know'.  Figure  1.3  shows  the  percentage 

of  answers  given  by  respondents  for  the  first  two  component  questions.  The  figure  shows 

a  curious  change  in  the  percentages  of  unsure  respondents  during  the  last  three  quarters 

of  1990--there  is  a  sudden  change  in  the  percentage  of  respondents  answering  'don't 

know'.   This  shift  is  very  pronounced  on  the  first  component  since  the  level  of  the  curve 

before  and  after  the  episode  is  around  zero.  The  phenomenon  appears  to  be  longer  lasting 

for  respondents  who  were  unsure  of  their  future  financial  condition.   We  see  that  for  the 

second  component  in  Figure  1.3,  the  percentage  of  unsure  respondents  stayed  above  its 

average  level  preceding  the  episode,  until  the  second  quarter  of  1991.    Also,  for  both 

components  during  the  episode,  the  percentage  of  unsure  respondents  seems  to  be 

negatively  correlated  with  both  the  percentage  of  respondents  who  felt  that  their  financial 

condition  will  remain  unchanged,  and  those  respondents  expecting  to  be  financially  better 

off.  We  can  infer  from  this  discussion  that  useful  information  related  to  the  episode  may 

be  gathered  by  examining  factors  related  to  respondents  who  were  unsure  of  their  current 

and  future  financial  conditions.  We  choose  to  focus  on  consumer  expectations  of  personal 

finances-and  not  buying  plans  or  national  expectations— to  see  how  certain  demographic 

descriptions  of  consumer  households  changed  over  the  time  period.    Figure  1.4  again 

shows  the  percentages  of  respondents  who  were  unsure  of  both  their  current  and  future 

financial  conditions  from  the  first  quarter  of  1990  to  the  third  quarter  of  1991.  The  figure 

shows  a  high  degree  of  correlation  between  the  two  curves.    Thus  we  can  say  that  the 

'don't  know'  category  for  these  components  may  contain  key  'bits'  of  information  which 


8. 

fO 


Q. 

5 


Current-Personal-Finances 


60 
55 
50 
45 
40 

35  ^ 


8         30 


25 
20 
15 

10H 

5 
0 


Jan  '89  -  Dec  '92 


Legend 
BETTER 
SAME 
WORSE 
DONT  KNOW 


I  I  I  I  1  I  I  1  1  I  I  I  I  I 


I    I    I    I    I    I    I    I    I    1    1    I    I    I  T 1 


Quarter 


Q1  Q2  Q3  Q4  Q1  Q2  Q3  Q4  Q1  Q2  Q3  CM  Q1  Q2  Q3  Q4 
1989  1990  1991  1992 


Future-Personal-Finances 


65 
60 
55 
50 

45 

8  40- 

f 

I  ^^ 

is  30- 

%         25 

g        20  H 

< 

15 
10 

5-1 
0 


.  -        \  ,  -»-. 


Jan  '89  -  Dec  '92 


■'■\ 


Q1  Q2  Q3  Q4  Q1  Q2  Q3  Q4  Q1  Q2  Q3  Q4  Q1  Q2  Q3  Q4 

_      _^    1989  1990  1991  1992 

Quarter 

Figure  1.3    Consumer  expectations  of  personal  finances 


Legend 
BETTER 
SAME 
WORSE 
DONT  KNOW 


Financially  Unsure  Respondents 


? 

c 

(D 

o 

a. 

<D 
v> 


40n 

Jan  '90  -  Jul  '91 

35- 

30 

K 

25 

l\ 

1     \ 

20 

;            \ 

'  /\         ^ 

/    \         \ 

15- 

10- 

'/      vA '~  ~  ~ "' 

5- 

^      U^I^ 

n- 

i— 1 — 1 — 1 — 1 — 1 — t — 1 — 1 — 1 — 1 — 1 — 1 — 1 — 1 — 1 — 1 — 1 — 1 — 1 — 

Legend 
Cutrent-Petsooal 
Puture-Peisanal 


Quarter 


Q1 
1990 


Q2  Q3  CM  Q1  Q2  Q3 

1991 


Figure  1 .4   Percentages  of  financially  unsure  respondents 


may  be  potentially  informative  of  the  ensuing  economic  downturn.  Incidentally,  the 
percentage  of  unsure  respondents  for  the  other  three  components  shown  on  Figure  1 .5  do 
not  undergo  rapid  changes  in  1990.  However,  for  these  group  of  answers,  we  observe  an 
interesting  pattern  in  the  percentage-levels  for  respondents  whose  expectations  were  good 
vs.  those  who  had  bad  expectations.  We  see  that  this  phenomenon  appears  both  in  the 
episodal  period  and  the  following  recovery  period. 


10 


eo 
so 

70 

eo 

50 
40- 
30 
20- 

10- 


Crs  1-Year  Condition. 

Jan    'S9  -  Dec    '92 


OOOI3 


I>01>rT  ICNO'W 


— f     "^—^ 


Q-l  QZ  03  04  ai  02  03  04  ai  OZ  03  04  01  02  03  04 
.!»•»  i»»o  i.»»i  i»»a 


C/.S  5-Ycar  Condition 


ee- 

Jan 

'SS>  -  Z>ecr    '92. 

BO- 

l-<«S«ncl 

ss  - 

OOOD 

so  - 

•4S  - 

A 

■^ 

^ 

A 

A    -''''^ 

\ 

TJNCBH.XAJN 

•4o^ 

/ 

\ 

w^           /^  -^^/^ 

1    -^x\ 

/       v'            /v> 

/ 

\  / 

BLAX> 

35  - 

/ 

v-aV 

^       / 

^          /  \              /  ^ 

\ 

/ 

DQNT  KNOW 

30- 

1 

V 

V^ 

\^  V      ^       \  /~^ — \/ 

^ 

/  \ 

2S- 

— 

^ 

~\ 

1 

V                    A 

■ 

1S- 

N^ 

^ 

-'' 

■  \ 

/^>.-N 

,  ■"  ^  /       "*■■ 

/''^•_ 

^'      'v' 

10  - 

- 

'- 

—• ^■'■..•'•■' 

.\ 

s  - 

_-,    ,*..' 

o-l 

-,       , 

y— 

,     , 

. 

.9I-°2  "^^  *^*  Q""    Q2  Q3  Q4  Q1    Q2  Q3  O*  Q1    Q2  Q3  Q4 
1989  1990  1991  1992 


TS 
7-0 

as 
ao 

ss 

so 
4S  - 

40 
3S 
30 
25 
20 
IS 
10 

s 
o 


Six-Al^owztH-F£ouschold 

Jan    '89  -  Dec    '92 


OOOD 
BkAD 


.91-.°^  °^  ***  *^''    C2  Q3  Q4  Q1    Q2  Q3  Q4  Q1    Qz  Q3  Ot 
""•"  1990  1991  1992  ^^ 


Figure  1.5    Expectations  of  national  conditions  and  buying  plans 


11 

1.2.3    Consumer  Confidence  Metrics 

Given  the  previous  discussion,  a  question  we  examine  in  this  dissertation  concerns 
the  current  practice  for  constructing  a  composite  index  such  as  BEBR's  CCI.  The 
respondents'  answers  are  qualitative.  This  data  must  be  quantified  for  inclusion  in  most 
quantitative  models  based  on  traditional  statistical  techniques,  and  common  practice  for 
quantifying  the  qualitative  data  employs  the  use  of  a  balance  score  (Katona  and  Mueller 
1956;  Juster  1959;  Didow  et  al.  1983;  Gianotti  1989;  Niemira  1992).  The  procedure 
works  as  follows.  First,  we  must  express  the  survey  results  as  the  percentages  of 
respondents  choosing  one  of  three  possible  alternatives-Better,  Same,  or  Worse  (or. 
Good,  Uncertain,  Bad).  Let  us  denote  the  percentage  of  respondents  answering  'Better" 
(or  'Good')  as  P*;  the  percentage  of  respondents  answering  'Worse'  (or  'Bad')  as  P ;  and, 
P^  represents  the  percentage  of  respondents  answering  the  same  or  uncertain.  The 
balance  score  is  then  {P*  -  P),  or,  the  difference  between  the  percentages  of  two  out  of 
three  categories  of  answers  to  a  set  of  questions.  An  index  is  usually  constructed  by 
computing  the  balance  score,  and  perhaps  adding  some  constant  and/or  error  term. 
Niemira  (1992)  gives  the  forms  used  for  computing  the  three  national  measures.  They 
are  as  follows:  (1)  the  University  of  Michigan  uses  a  balance  plus  lUU,  or,  {P*  -  P)  + 
100;  (2)  Assuming  {P*  ^  P),  the  Conference  Board  measure  is  given  by  [  P*  /  {P*  -  P) 
];  and  the  ABC  News  poll  is  just  (P"^  -  P  ).  We  notice  two  shortfalls  associated  with 
these  approaches.  First,  a  requirement  for  computing  a  balance  score  in  this  way  is  that 
P",  P",  and  P  must  sum  up  to  100.  Given  that  there  are  four  categories  of  responses, 
there  are  two  alternatives  for  satisfying  this  requirement.   The  first  is  that  one  can  simply 


12 
discard  the  data  for  the  'don't  know'  category.   We  have  shown  that  for  the  time  period 

we  want  to  study,  this  category  contains  potentially  useful  information!     A  second 

alternative  is  to  group  the  'don't  know'  category  with  one  of  the  remaining  ones  for 

computing  the  percentages.  Figure  1.6  shows  are-calculated  BEBR  index  associated  with 

each  respective  grouping  of  the  'don't  know'  category  with  the  other  three  categories. 


Re-calculated  Confidence  Index 


120 

115^ 

110 
105 
I  100 
95 
90 
85 
80- 
75- 
70 
65 
60 
55 
50 


Combining  Don't  Know  With  Other  Categories 

L^end 
Better  +  Don't  Know 

Same  +  Don't  Know 

.  .'     -  -    Worse  +  Don't  Know 


Quarter 


I     I     I     I     I     I     I     I    I     I     '     I     I     I     I     I    I     I     I     I     I     I     I     I     I     I     I     I     I     I     I     I     I     I     I     I     I     I    !     I     I    I     I     I     I     I     I     I 

Q1  Q2  Q3  04  Q1  Q2  Q3  Q4  Q1  Q2  Q3  04  01  02  03  04 
1989  1990  1991  1992 


Figure  1.6    Indexes  determined  by  'don't  know'  combinations 


13 

We  observe  from  the  figure  that  the  re-calculated  index-values  are  reasonably  correlated, 

and  may  be  higher  or  lower,  depending  on  the  particular  combination  we  chose. 
Grouping  the  'don't  know'  category  with  the  'worse'  category,  for  example,  gives  index- 
values  that  are  never  higher  than  those  obtained  when  we  group  the  'don't  know'  category 
with  the  'same'  category.  We  also  observe  from  Figure  1.6  that  the  vertical  distance 
between  the  curves  before  and  during  the  episode,  is  significantly  larger  than  the 
corresponding  distances  during  the  recovery  period.  Some  researchers  propose  that  the 
'unsure'  respondents  may  resemble  the  pessimistic  replies  in  their  effect  (Katona  and 
Mueller  1956).  Didow  et  al.  (1983)  propose  that  consumer  unsureness  may  in  fact  have 
an  optimistic  or  pessimistic  connotation  with  respect  to  overall  confidence.  Common 
practice  for  computing  a  balance  score  groups  the  'don't  know'  responses  with  the  'same' 
responses.  From  a  previous  discussion,  we  saw  that  these  two  categories  appear  to  be 
negatively  correlated.  This  means  that  if  we  were  to  study  a  combined  group  for  them, 
then  any  key  'bits'  of  information  they  may  contain  individually  may  become  masked  in 
the  combined  group.  Hence,  we  use  neither  of  these  two  alternatives  and  instead  prefer 
to  study  each  category  separately. 

A  second  shortcoming  associated  with  balance  scores  is  that  they  completely 
ignore  the  percentage  of  respondents  who  feel  the  'same',  or  P^.  Many  researchers 
debate  this  method  because  ignoring  P^  implies  that  different  percentage  weights  can 
result  in  the  same  score  (Didow  et  al.  1983;  Gianotti  1989).  By  choosing  to  examine 
each  of  the  four  categories  individually,  we  immediately  gain  some  advantages  over 
commonly  used  approaches.  We  do  not  undertake  the  task,  in  this  work,  of  determining 


14 

an   appropriate   set  of  weights  to   use  with   the   four  categories  for  the  purpose   of 

determining  a  consumer  confidence  metric.  Didow  et  al.  (1983)  studied  the  problem  of 
finding  a  set  of  weights  for  the/our  categories  (i.e.  'better',  'same'  'worse',  don't  know'), 
and  give  results  from  using  an  alternating  least-squares  optimal  scaling  model  for 
evaluating  scales  based  on  mixed  metric  responses.  The  authors  used  data  from  two 
national  surveys  of  consumer  finances  conducted  by  the  University  of  Michigan's  Survey 
Research  Center  in  1971  and  1973.  Using  the  'PRINCIPALS'  (principal  components 
analysis  via  alternating  least  squares)  algorithm  in  their  study,  Didow  et  al.  (1983)  give 
empirical  results  having  inconsistencies  with  Michigan's  ICS  construction  procedure,  and 
demonstrate  the  potential  of  using  their  alternating  least-squares  optimal  scaling  model 
for  developing  a  better  measure  of  consumer  confidence. 

In  this  research,  we  seek  useful  demographic  descriptions  for  the  four  categories 
representing  the  respondents'  expectations  of  their  future  financial  condition.  We  prefer 
descriptions  in  terms  of  such  attributes  as  age,  income,  and  party  affiliation.  Many 
researchers  commonly  use  demographics  as  independent  variables  in  their  analytical  and 
empirical  models  (Ketkar  and  Cho  1982:  Wagner  and  Hanna  1983;  Kent  1992;  Morwitz 
et  al.  1993;  Sawtelle  1993).  Next  to  income,  attributes  such  as  age,  marital  status, 
employment  status,  etc.,  can  considerably  influence  consumer-household  expenditures  for 
market  goods  and  services  (Wagner  and  Hanna  1983).  Thus,  an  hypothesis  we  propose 
is:  given  useful  demographic  descriptions  for  the  four  categories  of  responses 
representing  the  respondents'  future  financial  condition,  descriptions  for  the  '  don  t  know' 
category  are  unique,  in  that  they  are  unlike  any  of  the  remaining  three,  to  a  certain 


15 

extent,  and  should  not  be  combined  with  one  of  the  remaining  three  to  examine  certain 

changes  taking  place  during  the  1990  -1991  recession.  'Useful  descriptions,  in  this 
context,  are  descriptions  produced  by  a  credible  reasoning  procedure,  that  are  also  easily 
interpretable  by  humans. 

1.3    Consumer  Expectations  and  Buying  Plans 

Economists  and  business  analysts  also  examine  data  associated  with  changes  in 
consumer  demand  for  automobiles  occurring  during  the  episode.  Hall  (1993)  suggests 
that  consumers  unwillingness  to  buy  automobiles  was  a  significant  factor  associated  with 
the  consumption  shock.  Perry  and  Schultze  (1993)  state  that  during  the  recovery  period, 
motor  vehicle  purchases  is  an  area  where  consumer  spending  was  substantially  over- 
predicted.  A  question  in  the  BEBR  survey  ask  whether  anyone  in  the  household  plans 
to  buy  a  car  or  truck.  Figure  1 .7  shows  the  percentages  of  respondent-answers  for  the 
alternative-answers  of  'yts\  'maybe',  and  'don't  know'.  From  the  figure  we  see  that, 
during  the  recessional  period,  the  percentage  of  'yes'  respondents  dropped  about  six 
percentage  points  before  leveling  off  about  five  points  lower.  A  similar  analysis  of  the 
percentage  of  'no'  respondents  showed  that  it  increased  by  ten  points  before  leveling  off 
at  a  level  five  points  higher.  From  these  observations,  and  given  our  previous  discussion 
concerning  information  for  the  'don't  know'  category,  a  plausible  hypothesis  is  that 
changes  taking  place  in  the  'don't  know'  category  of  consumer  expectations  for  personal 
finances,  buying  plans,  and  national  conditions,  best  explain  the  change  in  consumers' 
unwillingness  to  purchase  cars.   Since  this  category  is  not  considered  in  most  methods 


16 


1 

20- 

18 

16 

to 

14- 

vf 

'g 

12- 

8 

& 

10 

(L 

m 

i 

8 

00 

^ 

6 

4 

2^ 

0 


Car-Purchase-Plans 

Jan  '89  -  Dec  '92 


Legend 
YES 
MAYBE 
DONTKNOW 


Quarter 


'  "i  I  I  I  1  I  I  I  I  I  I  I  I  I  f  1  I  I  I  I  I  I  T  I  I  I  I  I  I  I  I  'I  r  I  I  I  r  I  I  '  1  I  I  I 

Q1  Q2  Q3  04  Q1  02  Q3  04  01  02  03  04  01  02  03  04 
1989  1990  1991  1992 


Figure  1.7    Consumer  car  purchase  plans 


for  computing  a  consumer  confidence  metric,  this  may  explain  why  consumer  spending 
was  over-predicted  for  motor  vehicle  purchases. 

In  order  to  examine  this  premise,  we  require  information  of  the  respondents'  plans 
for  buying  a  car  related  to  their  financial,  buying  plans  and  national  expectations.  Thus, 
there  are  four  guiding  principles  for  this  research  on  how  to  obtain  the  information  we 
seek:  (1)  the  domain  knowledge  is  induced  from  the  observed  data;  (2)  the  knowledge  is 
obtained  using  a  credible  reasoning  mechanism;  (3)  a  decision  theoretic  approach  is  used 


17 
to  form  descriptions;  and  (4)  these  descriptions  are  easily  interpretable  by  humans.    A 

working  premise  of  this  dissertation  is  that  we  can  successfully  use  Artificial  Intelligence 

(AI)  methods  and  techniques  to  produce  the  information  we  require.  Researchers  in  many 

different  research  communities  employ  different  perspectives  and  methods  for  using  Al. 

The  next  section  describes  approaches  used  in  our  research. 

1.4   AI  and  Feature  Construction 

Al  based  knowledge-acquisition  procedures  usually  examine  examples  of  solved 
cases  and  give  general  decision  rules  in  terms  of  a  pre-defined  structure.  Knowledge 
acquisition  is  commonly  referred  to  in  the  field  of  machine  learning  as  concept  learning 
where  the  fundamental  goal  is,  for  example,  to  extract  descriptions  that  best  describe  the 
sample  data.  Weiss  and  Kulikowski  (1991)  make  three  key  points  of  interest:  (1)  machine 
leaming  methods  can  give  solutions  in  formats  easily  understood  and  more  compatible 
with  human  reasoning;  (2)  from  the  perspective  of  minimizing  average  error  rates, 
leaming  systems  can  be  viewed  as  attempts  to  approximate  Bayes  rule;  and  (3)  decision 
trees  are  currently  the  most  highly  developed  machine-learning-technique  for  partitioning 
samples  into  a  set  of  covering  decision  rules. 

The  1D3  algorithm  developed  by  J.  R.  Quinlan  is  an  extensively  studied  technique 

for  inducing  a  decision  tree  from  a  set  of  examples.    Quinlan  (1990a)  states: 

Decision  trees  provide  a  powerful  formalism  for  representing 
comprehensible,  accurate  classifiers.  The  top-down  method  of  constructing  them 
is  computationally  undemanding.  When  they  are  used,  information  regarding 
attribute  values  is  sought  only  as  required,  making  them  attractive  in  a  diagnostic 
context.  TTiey  have  also  been  found  useful  as  components  of  intelligent  systems... 
(345) 


18 

We  use  decision  trees  as  a  general  model  for  the  learning  system  we  develop.  A  decision 

tree  is  a  structure  consisting  of  nodes  and  branches  where  each  node  represents  a  test  or 
decision.  There  is  a  branch  attached  to  the  node  for  every  possible  outcome  of  the  test. 
Thus,  performing  the  test  gives  a  partition  of  two  or  more  disjoint  sets  covering  the 
outcomes  of  the  test.  The  tree  branches  to  another  node,  according  to  the  outcome  of  the 
test,  until  a  leaf  or  terminal  node  is  reached.  The  terminal  leaves  correspond  to  sets  of 
the  same  class  or  category. 

Quinlan  (1993)  gives  two  shortcomings  related  to  using  decision  trees:  (1)  they 
can  be  cumbersome,  complex,  and  inscrutable  due  to  the  specific  context  established  by 
the  outcomes  of  tests  at  antecedent  nodes;  and,  (2)  the  structure  of  the  tree  may  cause 
individual  subconcepts  to  be  fragmented,  or  appearing  twice  or  more  in  the  tree  in  a  way 
making  the  tree  harder  to  interpret.  Quinlan  (1993)  also  gives  two  ways  to  avoid  this 
'replication'  problem:  create  more  task-specific  attributes,  or,  use  a  different  structure  for 
representing  the  knowledge  (i.e..  Production  Rules). 

Pagallo  (1990)  develops  an  algorithm  that  creates  more  task-specific  attributes  in 
a  way  such  that  decision  trees  constructed  using  these  attributes  avoid  the  replication 
problem.  The  replication  problem  is  a  representation  shortcoming  where  duplications  of 
decision-sequences,  or  patterns,  exist  for  determining  truth  settings  (Pagallo  1990).  The 
idea  of  creating  new  task-specific  attributes  is  commonly  referred  to  as  feature 
construction.  Feature  construction  is  a  technique  for  creating  new  features  which  are 
combinations  of  the  existing  attributes.  This  is  a  type  of  representation  change,  where 
each  term,  or  feature,  in  the  concept  description  is  a  function  of  the  initial  prime 


19 
attributes.    A  major  area  of  focus  in  this  researcii  seeks  to  improve  feature  construction. 

In  our  research,  we  form  features  that  are  conjuncts  of  attributes. 

Feature  construction  algorithms  use  an  empirical-learning-method  to  construct  new 

features  from  the  initial  prime  attributes.     Pagallo  (1990)  developed  an  approach, 

FRINGE,  which  was  well  received  by  many  researchers.    The  heart  of  the  FRINGE 

algorithm  for  learning  DNF  concepts  works  by  combining  attributes  along  the  positive 

paths  of  a  decision  tree  (Pagallo  1990;  Pagallo  and  Haussler  1990).    One  drawback  to 

constructing  features  in  this  way  is  that  the  learning  algorithm  requires  a  long  time  to 

process  the  examples.  FRINGE  works  by  forming  features  firom  a  given  decision  tree  and 

uses  the  existing  and  new  features  to  build  a  decision  tree  for  the  next  iteration  of  the 

algorithm.     Testing  large  numbers  of  features  when  constructing  the  decision  tree 

increases  the  time  of  each  iteration,  and  this  significantly  raises  the  total  running  time  of 

the  algorithm.     Adding  to  the  dilemma  is  that  Pagallo  (1990)  reports  that  in  one 

experiment,  only  10%  of  the  total  number  of  new  features  were  actually  used  by  the 

algorithm.   She  suggests  two  ways  to  improve  the  computational  efficiency  of  FRINGE. 

The  first  is,  at  each  iteration,  to  remove  the  features  that  are  not  useful  to  the  learning 

task  from  the  feature  set.   Pagallo  (1990)  focuses  on  using  this  approach  and  refers  to  it 

as  feature  pruning.   A  second  approach  is  to  limit  the  number  of  new  features  included 

in  the  feature-set  used  to  construct  the  decision  tree.    This  dissertation  focuses  on  this 

latter  approach. 


20 

1.4.1    Time-Complexity  of  Feature  Construction 

In  this  work  we  focus  on  the  time  complexity  of  feature-construction  learning- 
algorithms  using  decision  trees  as  a  concept  description  language.  We  also  examine  the 
ranks  of  decision  trees.  The  rank  of  a  binary  decision  tree  is  an  indicator  of  the 
conciseness  of  the  tree.  Note  that  a  binary  decision  tree  is  an  approximate  representation 
of  a  target  concept.  The  tree  is  concise  if,  for  example,  each  decision  variable  in  the  tree 
is  a  term  in  the  target  concept.  We  regard  a  decision  tree  having  a  rank  of  one  as  being 
(1)  a  concise  representation  of  a  target  concept,  and  (2)  devoid  of  the  'replication' 
problem.  Trees  of  higher  ranks  reflect  a  more  complex  representation  for  a  given 
concept.  We  prefer  concise  representations,  thus,  a  general  goal  for  forming  features  is 
that  we  form  features  that  aid  in  reducing  the  ranks  of  decision  trees  for  a  target  concept. 

Ehrenfeucht  and  Haussler  (1989)  give  a  polynomial  learning  algorithm  for  Boolean 

Trees,  FIND(S.r),  that  when  given  a  sample,  S,  of  a  Boolean  function  over  n  Boolean 

variables,  produces  a  bounded  rank  decision  tree  of  rank  /•  that  is  consistent  with  5,  or 

fails.    Ehrenfeucht  and  Haussler  (1989)  state: 

We  define  the  rank  of  a  decision  tree  and  exhibit  a  learning  algorithm  that  for  any 
target  function  /  represented  by  a  decision  tree  of  rank  at  most  r  on  n  Boolean 
variables,  and  any  distribution  P  on  {0,1 )",  produces  with  probability  at  least  1  - 
5,  a  hypothesis  (represented  as  a  decision  tree  of  rank  at  most  r)  that  has  error  at 
most  e.  For  any  fixed  rank  r,  the  number  of  random  examples  and  computation 
time  required  for  this  algorithm  is  polynomial  in  n  and  linear  in  l/e  and  log(l/5). 
(232) 

The  time  complexity  of  the  algorithm  is: 


21 
Lemma  3  [Time  Complexity  of  FIND(S,r),  Ehrenfeucht  and  Haussler  (1989)]. 

For  any  nonempty  sample  S  of  a  function  on  X„  and  r>0,  the  time  of  FIND(S,r) 
is  0(/S/(n+}f').    (238) 

This  is  a  very  key  result,  and  the  analytical  model  we  develop  in  this  research  extends 

this  result  by  presenting  a  model  of  the  time  complexity  of  an  algorithm  as  we  add  new 

features.   We  propose  adding  j  new  features  constructed  using  time  FC(j)  such  that: 

in  +  1)'"    >     (n  +  1  +  j)2'^-i>       +      FC(j) 

where  we  get  a  new  decision  tree  of  rank  (r-i). 

For  our  analysis,  we  separate  the  complexity  of  our  model  into  two  factors:  (1)  the 
tree-construction-factor,  or  (n+l+j)^'"",  and,  (2)  the  feature-construction-factor,  or  FC(j). 
Ideally,  we  desire  that  the  sum  of  the  individual  factors  reduce  the  order  of  the  time 
complexity  resulting  from  simply  building  a  decision  tree  using  the  initial  prime 
attributes.  Considering  the  tree-construction-factor,  we  use  Taylor's  Theorem  to  develop 
a  model  allowing  us  to  determine  the  maximum  number  of  new  features  we  can  add  to 
an  existing  set  of  features,  such  that  the  order  of  the  time  complexity  for  building  a  new 
tree  with  this  new  set  of  features  is  no  greater  than  the  order  of  the  time  complexity 
resulting  from  using  the  initial  features  to  build  the  given  tree.  A  key  assumption  in  our 
model  is  that  the  new  features  we  create  will  in  fact  be  used  by  the  tree-construction 
heuristics,  giving  a  new  decision  tree  of  lesser  rank.  Thus,  another  hypothesis  proposed 
in  this  dissertation  is  as  follows:  we  can  build  a  decision  tree  within  the  standard  time 
complexity  by  adding  at  most  j  new  and  useful  features  to  the  existing  set  of  features. 


22 
Useful  features  are  features  that  are  likely  to  be  selected  by  the  heuristics  used  to 

construct  the  decision  tree. 

1.4.2    DUALTREE  Feature  Construction 

Our  analysis  hinges  on  being  able  to  create  'useful'  features.  Indeed,  a  key 
premise  is  that  we  are  able  to  construct  decision  trees  of  smaller  ranks.  If  we  cannot 
construct  such  trees  after  creating  a  limited  number  of  features,  then  the  features  may  not 
be  'useful'  for  our  purposes.  Considering  this  problem,  this  dissertation  also  focuses  on 
developing  a  procedure  to  construct  minimal  feature-sets  in  a  computationally  efficient 
way.  Furthermore,  the  procedure  must  preserve  the  basic  structure  of  the  subsets 
represented  by  the  internal  nodes  of  a  decision  tree.  We  propose  such  a  procedure  based 
on  Beckman's  (1980)  method  for  finding  an  automaton  with  the  smallest  number  of  states 
that  accepts  precisely  the  same  set  of  tapes  as  a  given  non-deterministic  finite  automaton. 
The  author  adopts  Webster's  definition  of  an  automaton  which  is  "a  machine  or  control 
mechanism  designed  to  follow  automatically  a  predetermined  sequence  of  operations  or 
respond  to  encoded  instructions."  Our  procedure  combines  certain  concepts  from  the 
theories  of  sets  and  categories  with  this  procedure  and  symbolizes  a  different  approach 
to  feature  construction. 

A  decision  tree  constructed  using  boolean  attributes  characterizes  a  classification 
procedure  that  associates  a  unique  leaf  of  the  tree  with  any  object,  even  one  not  in  the 
original  Gaining  set.  Further,  the  leaves  of  the  tree  partition  the  space  of  objects  into 
disjoint  categories  (Quinlan  and  Rivest  1989).   One  appeaUng  aspect  of  category  theory 


23 

is  that  we  can  construct  universal  descriptions  using  category-theoretic  terms  (Blass  1984). 

Examples  of  concepts  with  universal  descriptions  include  the  set  of  natural  numbers, 
power  set,  cartesian  product,  the  logical  connectives  and  quantifiers.  The  category- 
theoretic  framework  captures  the  important  structural  properties  of  these  descriptions  since 
we  view  objects  in  the  category  as  generalized  sets  (Blass  1984).  A  key  idea  used  in  our 
work  relates  to  the  concept  of  'duality'  and  its  meaning  for  categories.  Forming  the  dual 
of  a  category  amounts  to  keeping  the  same  objects  but  twisting  the  structure  of  the  links 
between  the  objects.  For  example,  when  the  links  between  the  objects  are  'directed' 
links,  we  'twist'  the  structure  by  simply  'reversing'  the  directions  of  all  links.  We  use 
this  concept  to  develop  our  idea  of  the  dual  of  a  decision  tree.  Krishnan  (1981)  presents 
the  following  view  of  the  principle  of  duality  for  categories: 

Usually  we  visualize  a  category  as  a  class  of  points  for  the  objects,  and  a 
class  of  arrows  for  the  morphisms,  each  arrow  going  from  the  point  that  is  its 
domain  to  the  point  that  is  its  codomain.  For  finite  categories  these  diagrams  can 
be  drawn  on  paper.  The  dual  of  a  category  is  then  pictured  as  one  with  all  the 
names  for  the  objects  as  well  as  morphisms  unchanged  and  only  the  direction  of 
each  arrow  reversed.    (39) 

We  refer  to  our  procedure  for  forming  new  features  as  DUALTREE,  since  it  works  by 

constructing  the  dual  of  a  decision  tree.   After  obtaining  the  dual  of  a  decision  tree,  the 

DUALTREE  procedure  forms  new  features  in  a  manner  commonly  caUed   'subset 

construction'  in  the  literature  on  automata  theory.  We  refer  to  it  as  'feature  construction' 

in  the  sequel. 

We  next  illustrate  our  procedure  using  a  simple  decision  tree.    First  we  describe 

how  to  form  the  dual  of  a  decision  tree,  and  then  show  how  DUALTREE  forms  new 

features.  We  propose  that  forming  new  features  in  this  way  results  in  having  a  feature-set 


24 
of  lower  cardinality  to  use  for  building  a  decision  tree.    Finally,  we  show  results  from 

using  our  procedure  and  an  initial  decision  tree  for  a  subset  of  the  BEBR  sample  data  we 

analyze  in  this  work. 

1.4.2.1    Dual  decision  trees 

To  show  how  to  form  the  dual  of  a  tree,  assume  we  are  given  a  bounded-rank 

decision  tree  of  rank  r,  constructed  using  n  boolean  attributes  or  features.  Now,  we  make 

the  following  assignments.  Let  the  subsets  determined  by  the  internal  nodes  along  a  path 

from  the  root  node  to  a  leaf,  represent  the  objects  for  a  category.    Let  the  edges  of  the 

path  represent  the  morphisms  for  the  category.    A  path,  which  is  a  set  of  vertices  and 

edges,  consists  of  predecessor  and  successor  vertices  where  the  directions  of  the  edges 

'point'  to  the  successor  vertex.   Reversing  the  directions  of  these  morphisms  amounts  to 

saying  that,  for  a  given  path,  each  edge  will  point  to  the  predecessor  vertex  if  the  order 

in  which  the  vertices  are  listed  remains  unchanged.  We  form  the  dual  of  a  tree  using  this 

idea  of  'reversing'  the  directions.    Figure  1.8  shows  both  a  decision  tree,  (constructed 

using  five  boolean  attributes),  and  its  dual.   For  the  dual  tree,  only  the  directions  of  the 

edges  differ  while  the  orientation  of  the  nodes  and  the  node-labels  remain  the  same. 

Also,  references  for  several  of  the  nodes  in  the  tree  have  changed.  Using  this  bottom-up 

approach  for  constructing  the  dual  tree,  we  may  acquire  several  root  nodes  but  only  one 

terminal  node,  or  vice  versa,  depending  on  the  structure  of  the  initial  tree.    With  the 

exception  of  the  sole  terminal  node,  each  node  in  the  dual  tree  has  exactly  one  child  and 

zero  or  two  parent  nodes.    Note  also  for  this  example  that  the  decision  tree  includes  the 

'replication  problem'.    This  representational  complexity  stems  from  the  duplication  of 


25 


A  decision  tree 


The  dual  tree 


Figure  1.8   A  decision  tree  and  its  dual 


26 
tests  in  the  tree  to  determine  if  an  instance  satisfies  a  term.    Hence,  our  tree  is  not  a 

concise  decision  tree. 

We  show  the  dual  tree  redrawn  in  Figure  1.9  so  that  the  root  nodes  are  at  the  top. 

In  Figure  1.9  we  see  that  every  node,  except  the  roots,  has  exactly  two  parents.    Also, 

considering  the  two  edges  pointing  to  a  node,  the  '0-edge'  comes  from  the  left  and  the 


yr 


Figure  1.9   The  re-oriented  dual  tree 


27 
'1-edge'  comes  from  the  right.  For  now,  we  adopt  this  rule  for  ordering  the  two  parents 

of  a  node.  For  DUALTREE  feature-construction,  we  make  the  following  set  assignments 

for  R-the  set  of  root  nodes,  and  E~the  set  of  edges,  using  the  dual  tree: 

(1)  R  =  {0,1} 

(2)  E  =         {(0,0,X3),  (0,0,X5),  (0,1, X4),  (X3,0,X2),  (X5,0,X4),  (x^Lx,),   (1,1.x,),   (l,l,x,), 

(X3,0,Xi),   (X2,l,Xi)} 

The  set  E  consists  of  triples  denoting  the  edges  in  the  dual  tree.  The  first  and  third 
elements  of  the  triple  represent  the  predecessor  and  successor  nodes,  respectively.  The 
second  element  of  the  triple  is  the  value  of  the  test-function  that  this  edge,  or  triple, 
corresponds  to.  For  example,  the  first  element  of  E,  (0,0,X3),  represents  the  left-most  edge 
in  the  lower  half  of  Figure  1.9.  Our  triple  informs  us  that  if  we  are  at  node  '0'  and  the 
value  of  the  test  function  is  '0',  then,  for  this  instance,  we  go  to  node  X3.  The  cardinality 
of  E  is  10  and  the  last  two  edges  listed  are  the  terminal  edges  of  the  tree  (i.e.,  the 
successor  is  a  terminal  node). 

Continuing  with  the  construction,  we  create  a  new  tree  having  a  root  node  for  each 
element  of  R  by  constructing  all  possible  successor  nodes  for  the  roots.  Let  the  successor 
node  be  the  set  of  elements  that  are  possible  successors  of  any  element  of  the  predecessor 
node,  for  a  given  value  of  the  function.  Note  that  predecessor  nodes  may  themselves  be 
a  'set'  of  elements.  So,  for  example,  we  create  the  successor  nodes  for  the  root  nodes  of 
'0'  and  '1'  with  the  following  steps.  First,  we  form  the  successor  for  the  first  element 
of  R,  using  a  function-value  of  0.  Upon  examining  the  edges  in  E,  we  find  two  edges 
beginning  with  '0'  that  have  a  function-value  of  0.    The  successor  node  is  the  union  of 


28 
the  ending  node  for  each  of  these  two  edges,  or  X3  and  x,.   Thus,  given  a  function-value 

of  0,  the  successor  for  '0'  is  {Xi.x,}.    For  a  function-value  of  1,  the  successor  of  '0'  is 

just  {X4).    Continuing  in  this  fashion  we  find,  for  example,  that  given  a  function-value 

of  1,  the  successor  of  T  is  {x5,X2}.   The  successor  for  any  predecessor—function-value 

pair  not  in  E  is  defined  to  be  the  null  node.   Figure  1.10  shows  the  features  formed  after 

constructing  all  successors  for  the  dual  tree.     Observe  that  negated  attributes  are  not 

included  in  any  of  the  features.    Figure  1.11  shows  a  dual  tree  equivalent  to  our  initial 

dual  tree. 

Next,  we  construct  the  dual  of  the  dual-tree  in  Figure  1.11.  Performing  this  action 
produces  a  tree  once  again  having  features  for  the  roots  and  classes  as  leaves,  and  this 
orientation  allows  us  to  interpret  the  tree  as  we  are  normally  accustomed.  Figure  1.12 
shows  that  the  dual  of  the  dual  tree  has  six  terminal  paths  to  a  negative  leaf,  and  3 
terminal  paths  to  a  positive  leaf.  These  are  the  same  totals  as  the  respective  paths  in  the 
initial  decision  tree. 

The  final  step  for  the  procedure  is  to  perform  feature  construction  on  the  dual  of 
the  dual  tree  to  again  reduce  the  inherent  structure  of  the  tree.  Using  our  procedure,  we 
form  six  new  features  that  are  conjuncts  of  the  initial  attributes.   The  new  features  are: 

XjX^  X  jX2  ^2^5 

Thus,  we  add  six  new  features  to  the  set  of  five  attributes  resulting  in  a  feature-set 
containing  eleven  elements.  Finally,  using  our  new  feature-set.  Figure  1.13  shows  several 
equivalent  trees  for  the  initial  decision  tree,  also  having  a  bounded  rank  equal  to  one. 
This  completes  the  illustration  of  our  procedure.    Before  showing  a  decision  tree  for  the 


29 


A  blank  space  reprtsents  the  null  node 
■  a  terminal  node 

arootnodt 


0    i 


/ 


r^ 


Figure  1.10  Forming  features  in  the  dual  tree 


sample  data  we  study,  we  first  discuss  how  we  acquire  the  initial  decision  tree  used  by 
DUALTREE  to  form  new  features. 

1.4.3    ID3/C4.5  Decision  Trees  Using  BEBR  Sample  Data 


A  standard  decision  tree  algorithm  creates  an  hypothesis  by  recursively  selecting 
which  attribute  to  place  at  a  node  and  partitioning  the  set  of  examples  according  to  their 
values  for  the  test  attribute.  The  successive  divisions  of  the  set  of  examples  proceed  until 


30 


(   1  )-^l 


Null  nodes  are  not  shown 


Figure  1 . 1 1  A  dual  tree  after  feature  construction 


all  the  subsets  consist  of  cases  belonging  to  a  single  class,  or,  the  subsets  satisfy  some 
other  terminating  condition.  Decision  tree  algorithms  using  a  greedy  method  are 
nonbacktracking  smce  once  a  test  has  been  selected  to  partition  the  current  set  of 
examples,  the  choice  is  irreversible  and  other  attributes  or  features  are  not  considered. 
Also,  a  common  goal  of  greedy  algorithms  is  not  to  infer  more  structure  for  the  target 
concept  than  is  justified  by  the  set  of  examples.  In  other  words,  we  prefer  not  to 
construct  very  complex  trees  that  'overfit'  the  data.    Our  purpose  here  is  not  to  discuss 


31 


Figure  1.12  The  dual  of  the  dual  tree 


the  theory  of  decision  trees  since  it  is  well  documented  in  the  literature.    Instead,  we 
focus  on  several  of  the  major  concepts  of  interest. 

A  key  step  in  building  a  decision  tree  is  choosing  the  best  attribute  for  a  node  so 
that  the  test  at  the  node  gives  good  partitions.  When  one  chooses  the  best  attribute  based 
on  how  well  the  available  attributes  separate  the  classes  or  categories,  Breiman  et  al. 
(1984)  refer  to  this  as  'goodness  of  spht'.  Using  the  number  of  leaves  in  a  tree  as  a 
measure   of  its   size,   Mingers   (1989)   provides   an   empirical   comparison   of  several 


32 


'I'V 


/ 

^,        ' 

0 

/ 

1 

K 

0 

f\ 

1 

/ 

0 

( 

/ 

0 

Figure  1.13  Decision  trees  with  rank  equal  to  1 


"goodness  of  split'  measures  and  shows  that  the  choice  of  a  measure  affects  the  size  of 
a  tree  but  not  its  accuracy,  which  remains  essentially  the  same  even  when  attributes  are 
selected  at  random.  Note  that  the  number  of  leaves  in  a  tree  correspond  to  the  number 
of  distinct  'rules'  contained  within  the  decision  tree.  Also,  another  way  to  measure  the 
size  of  a  tree  is  by  counting  the  number  of  nodes  in  the  tree.  For  our  purposes,  we  prefer 
smaller  trees  since  large  trees  may  be  too  complex  for  humans  to  easily  understand. 
Mingers'  (1989)  tests  show  that  the  'gain-ratio  measure',  developed  by  Quinlan  (1986), 


33 

generally  leads  to  'smaller'  trees.     The  gain-ratio  criterion  chooses  the  test  which 

maximizes  the  proportion  of  information  generated  by  the  split  that  appears  helpful  for 
classification,  subject  to  the  constraint  of  also  giving  high  information  gain  (Quinlan 
1993). 

Quinlan  (1986,  1993)  proposed  an  evaluation  function  based  on  a  formula  from 
information  theory  that  measures  the  theoretical  information  content  of  a  code.  The  value 
of  this  measure  depends  on  the  likelihood  of  the  various  possible  messages.  If  they  are 
equally  likely,  then  we  have  a  case  representing  the  greatest  amount  of  uncertainty  and 
the  information  gained  will  be  the  greatest.  The  less  equal  the  probabilities,  the  less 
information  there  is  to  gain.  The  information-based  method  is  based  on  two  assumptions. 
Using  Quinlan 's  (1986)  notation,  let  C  be  a  collection  of  p  objects  of  class  P  and  n 
objects  of  class  N.    Quinlan's  (1986)  assumptions  are: 

( 1 )  Any  correct  decision  tree  for  C  will  classify  objects  in  the  same  proportion 
as  their  representation  in  C.  An  arbitrary  object  will  be  determined  to 
belong  to  class  P  with  probability  p/(p-i-n)  and  to  class  N  with  probability 
n/(p+n). 

(2)  When  a  decision  tree  is  used  to  classify  an  object,  it  returns  a  class.  A 
decision  tree  can  thus  be  regarded  as  a  classification  'P'  or  'N',  with  the 
expected  information  needed  to  classify  an  object  given  by 

I{p,n)      =     -  _^log,_^   -  ^Llog.     " 


p+n         " p+n        p+n         "  p+n 

Now,  suppose  attribute  A  is  used  as  the  root  for  a  decision  tree  over  C.  This  tree 
partitions  C  into  a  certain  number  of  smaller  collections  each  denoted  by  C,.  Let  each  C, 
contain  p^  objects  of  class  P  and  n,  objects  of  class  N.  The  expected  information  required 


34 
for  the  subtree  for  C,  is  I(Pj,n,).    The  expected  information  required  for  the  tree  with  A 

as  its  root  is  then  determined  by  the  weighted  average  given  by 


V 


P:*n^ 


where  the  weight  for  the  ith  branch  is  the  proportion  of  the  objects  in  C  that  belong  to 
C,.   Thus,  the  information  gained  by  branching  on  A  is  determined  by 

gain(A)  =  I(p,n)  -  E(A). 
A  drawback  to  using  this  measure  is  that  it  has  a  strong  bias  in  favor  of  tests  with 
many  outcomes.  Quinlan  (1993)  rectifies  this  bias  by  using  a  type  of  normalization  in 
which  the  apparent  gain  attributable  to  tests  with  many  outcomes  is  adjusted.  Hence,  the 
gain-ratio  measure  is  a  variant  of  the  information-gain  measure  that  incorporates  the  idea 
that  an  attribute  itself  can  have  some  information  value.  The  amount  of  information  value 
for  an  attribute  depends  on  the  distribution  of  examples  among  the  attribute's  possible 
values.  The  less  evenly  spread  its  values,  the  less  information  in  the  attribute.  Noting 
that  an  efficient  measure  should  convert  as  much  as  possible  of  the  attribute's  information 
value  into  the  classification  procedure,  Quinlan  (1993)  computes  the  ratio  of  the  gain  in 
information  from  using  the  attribute.  A,  to  the  information  value  of  the  attribute  itself. 


35 
Thus,  a  gain-ratio  is  given  by 

gain-ratioiA)    =  ^'^'"^    "  ^'^>         . 

frf    p+n  -    p+n 

The  value  in  the  denominator  has  a  high  score  if  the  examples  are  spread  evenly  between 
the  attribute  values  and  a  low  one  if  they  are  not.  Thus,  we  can  say,  for  example,  that 
the  gain-ratio  measure  favors  attributes  with  a  small  number  of  values. 

Quinlan's  C4.5  program,  an  elaboration  of  ID3,  uses  the  gain  ratio  measure  to 
build  decision  trees.  This  algorithm  is  well  studied  and  many  researchers  site  favorable 
results  from  its  use.  The  'information'  or  'entropy'  based  heuristics  used  by  the 
procedure  generally  produce  simpler  decision  trees,  especially  when  the  sample  sizes  are 
small  and/or  there  are  many  different  outcomes  for  the  possible  tests.  This  dissertation 
does  not  focus  on  the  heuristics  used  for  constructing  decision  trees.  However,  given  that 
(1)  the  problem  of  finding  a  decision  tree  with  the  minimum  expected  number  of  tests  is 
NP-complete  (Hyafil  and  Rivest  1976).  and.  (2)  a  vast  amount  of  Uterature  suggests  that 
ID3/C4.5  can  be  a  useful  component  of  an  intelligent  system  since  its  decision-tree- 
heuristics  are  supported  by  many  theoretical  arguments,  we  can  say,  for  example,  that 
the  nD3/C4.5  Program  represents  a  reasonable  or  practical  approach  for  extracting 
symbolic  information  from  a  set  of  examples.  Because  humans  interpret  symbolic 
knowledge  more  readily  than  they  do  a  collection  of  numbers  (i.e.,  statistical  classifiers 
and  neural  networks),  we  infer  from  this  discussion  that  the  C4.5  Program,  (Quinlan 
1993),  for  building  decision  trees  represents  a  practical  reasoning  mechanism  for  inducing 
domain  knowledge  from  sample  data. 


36 

Quinlan  (1993)  gives  a  detailed  overview  of  C4.5's  implementation.     In  this 

section  we  illustrate  results  of  using  our  DUALTREE  procedure,  given  a  decision  tree 
produced  by  C4.5  on  a  subset  of  the  BEBR  sample  data  of  plans  to  purchase  an 
automobile.  Our  purpose  here  is  to  show  how  DUALTREE  contributes  to  constructing 
a  smaller  decision  tree  and  not  to  discuss  implications  of  the  trees.  For  this  illustration, 
we  want  to  start  with  a  small  tree  in  order  to  keep  the  discussion  tractable.  To  keep  the 
initial  tree  small,  we  use  a  subset  of  the  BEBR  survey  intentional-data  for  automobile 
purchase  plans,  and  perform  the  following  two  actions.  First,  we  eliminate  all  of  the 
'NO'  respondents.  This  action  helps  reduce  the  size  of  a  tree  because  we  have  one  less 
category  or  class  to  describe  (i.e.,  leaves  have  a  fewer  number  of  possible  values).  Next, 
we  combine  the  quarterly  data  so  that  the  time  dimension  is  given  as  the  'first  half  or 
"second  half  of  each  respective  year.  This  helps  to  produce  smaller  trees  since  we  are 
decreasing  the  total  number  of  test-outcomes  when  testing  this  attribute  while  building  the 
tree  (i.e.,  instead  of  sixteen  consecutive  quarters— or  sixteen  links  to  other  nodes— we  have 
eight  consecutive  semi-annual  periods).  We  use  this  subset  of  data  for  input  to  C4.5 
where  the  class  or  category  for  each  instance  is  simply  the  respondents'  answer  of  'YES', 
'MAYBE',  'DK'  (i.e.  Don't  Know).  The  attributes  consist  of  the  five  component- 
questions  whose  values  are  'Better',  'Same',  'Worse'  'Don't  Know'  or  'Good' 
'Uncertain',  'Bad",  'Don't  Know".  Following  are  the  BEBR  survey  questions  and  the 
attribute  assignment  used  for  each  component-question  of  the  composite  index.  Note  that 
they  are  given  in  the  order  in  which  they  are  read  to  the  respondents. 


37 
CURFIN  ==>  Current-Personal  Expectations 

We  are  interested  in  how  people  are  getting  along  financially  these  days.  Would 
you  say  that  you  (and  your  family  living  there)  are  better  off  or  worse  off 
financially  than  you  were  a  year  ago? 

FUTFENJ  ==>  Future-Personal  Expectations 

Now,  looking  ahead— do  you  think  that  a  year  from  now  you  (and  your  family 
living  there)  will  be  better  off  financially,  or  worse  off,  or  just  about  the  same  as 
now? 

USFUFI  ==>  U.S.  1-Year  Condition 

Now  turning  to  business  conditions  in  the  country  as  a  whole— do  you  think  that 
during  the  next  12  months  we'll  have  good  times  financially,  or  bad  times,  or 
what? 

USNEX5  ==>  U.S.  5-Years  Condition 

Looking  ahead,  which  would  you  say  is  more  likely— that  in  the  country  as  a 
whole  we'll  have  continuous  good  times  during  the  next  five  years  or  so,  or  that 
we  will  have  periods  of  widespread  unemployment  or  depression,  or  what? 

GBTIME  ==>  Household  Buying  Expectations 

About  the  big  things  people  buy  for  their  homes— such  as  furniture,  a  refrigerator, 
stove,  television,  and  things  like  that.  Generally,  speaking,  do  you  think  now  is 
a  good  or  a  bad  time  for  people  to  buy  major  household  items? 

Table  1 . 1  shows  the  frequencies  for  the  class  labels.  Considering  Table  1 . 1  we 
can  say,  for  example,  that  a  decision  tree  consisting  of  a  single  leaf  labeled  YES, 
misclassifies  about  thirty  percent  of  the  3910  instances.  A  tree  produced  by  the  C4.5 
program  for  our  subset  of  data  is  shown  in  Figure  1.14.  This  tree  has  a  total  of  53  nodes 
and  misclassifies  roughly  thirty  percent  of  the  sample. 

The  tree  in  Figure  1.14  shows  that  there  are  sixteen  different  descriptions,  or 
terminal  paths,  for  the  YES  class  and  eleven  descriptions  for  the  MAYBE  class.    The 


38 


Table  1.1    Distribution  of  respondents'  answers  for  buying  a  car 


CLASS  =  Will  Buy  a  Car? 


Frequency 


Percent 


Yes 

Maybe 

Don't  Know 


2755 

1084 

71 


70.5 

27.7 

1.8 


numbers  shown  next  to  the  leaves  indicate,  for  example,  how  many  instances  reached  this 
leaf  /  how  many  of  the  instances  are  misclassified  by  the  leaf.  C4.5  insist  on  having  at 
least  two  outcomes  with  a  minimum  number  of  cases  for  any  test  used  in  the  tree.  This 
avoids  the  use  of  near-trivial  tests  which  typically  lead  to  odd  trees  with  little  predictive 
power  (Quinlan  1993).  C4.5  uses  various  heuristics  for  assigning  classes  to  terminal 
nodes  representmg  subsets  that  do  not  contain  the  minimum  number  of  cases.  Other 
criteria  also  exist  for  deciding  not  to  partition  a  subset  any  further.  Typical  ones  are 
based  on  assessing  the  split  from  the  perspective  of  statistical  significance,  information 
gain,  or  'error  reduction'  (Brieman  et  al.  1984;  Mingers  1989;  Quinlan  1993). 
Figure   1.14  shows  that  all  of  the  leaves  in  the  tree  misclassify  at  least  one  of  the 


39 


LEGEND: 


Outcome 
-  NO' 


Figure  1.14  A  decision  tree  for  the  BEBR  data 


instances.  Thus,  there  are  no  'perfect'  leaves  in  the  tree.  Quinlan  (1993)  suggests  that 
elements  of  'randomness'  are  introduced  in  a  method  that  chooses  a  particular  test  from 
several  equally  promising  ones,  when  the  tests  are  selected  based  on  examinations  of 
small  subsets  of  cases.   Thus,  our  attribute-tests  may  be  regarded  as  imperfect  since  the 


40 
attributes  do  not  capture  all  of  the  information  relevant  to  classification.  C4.5's  stopping 

criteria  focuses  on  having  a  significant  number  of  cases  at  each  leaf  so  that  the  tree 

reveals  the  structure  of  a  domain  and  has  good  'predictive'  power. 

We  now  use  DUALTREE  to  form  new  features  and  for  this  illustration,  we  prefer 

features  that  are  conjuncts  of  at  most  three  attributes—again  to  help  keep  the  illustration 

simple.  For  now,  we  are  interested  in  results  from  using  DUALTREE,  and  not  the  actual 

features  given  by  the  procedure.   Using  our  procedure,  we  identify  sixteen  new  features 

that  are  conjuncts  of  at  most  three  attributes,  for  inclusion  in  this  example.   The  sixteen 

new  features  we  add  to  the  set  of  initial  attributes  are: 

[(CURFIN=Same  ?)  &  (FUTFIN=Don't  Know  ?)] 

[(CURFIN=Same  ?)  &  (USNfEX5=Uncertain  ?)] 

[(FUTFIN=Better  ?)  &  (USFUFI=Don't  Know  ?)] 

[(FUTFIN=Better  ?)  &  (USNEX5=Uncertain  ?)] 

[(FUTFIN^Don't  Know  ?)  &  (USFUFI=Uncertain  ?)] 

[(FUTFIN=Don't  Know  ?)  &  (TIME=  2nd  '91  ?)] 

[(USNEX5=Better  ?)  &  (GBTIME=Good  ?)] 

[(USNEX5=Uncertain  ?)  &  (TIME=2nd  '91  ?)] 

[(USNEX5=Uncertain  ?)  &  (TIME=lst  '92  ?)] 

[(CURFIN=Same?)  &  (USNEX5=Uncertain?)  &  (TIME=lst  '92?)] 

[(CURFIN=Same  ?)  &  (FUTFIN=Don't  Know  ?)  &  (USFUFI=Uncertain?)] 

[(FUTFIN=Better  7)  &  (USNEX5=Bad  ?)  &  (GBTIME=Good  ?)] 

[(FUTFIN=Same?)  &  (GBTIME=Uncertain?)  &  (TIME^lst  '92?)] 

[(FUTFIN=Don't  Know  ?)  &  (USFUFI=Uncertain  ?)  & 

(USNEX5=Uncertain?)] 
[(FUTFIN=Don't  Know  ?)  &  (USFUFI=Uncertain  ?)  & 

(TIME=lst  '92  ?)] 
[(FUTFIN=Don't  Know  ?)  &  (USNEX5=Uncertain  ?)  & 


41 


(TIME=2nd  '91  ?)] 


We  show  a  decision  tree  in  Figure  1.15  resulting  from  using  the  new  features  along  with 
the  initial  attributes.  The  tree  in  Figure  1.15  still  misclassifies  about  thirty  percent  of  the 


13/5  «/3 

Figure  1.15  A  decision  tree  using  DUALTREE  features 


instances  but  only  has  thirteen  nodes— a.  reduction  of  forty  nodes  from  the  initial  tree.  The 
figure  also  shows  that  the  tree  contains  three  of  the  new  features.  Thus,  in  this  example, 
we  have  shown  that  including  fifteen  additional  features  in  the  set  of  features  for  building 
a  decision  tree,  reduces  the  size  of  the  tree  by  a  factor  of  four,  without  significantly 


42 
increasing  the  misclassification  rate  of  the  tree.   Also,  the  tree  in  Figure  1.15  has  a  rank 

equal  to  one.    Decision  trees  having  a  rank  of  one  are  generally  easier  to  comprehend 

since  it  is  easier  to  keep  track  of  the  outcomes  of  tests  at  the  antecedent  nodes.    This 

completes  our  illustration  of  how  DUALTREE  aids  in  building  smaller  decision  trees. 

1.5   Thesis  and  Objectives 

The  general  goal  of  this  research  is  to  use  AI  methods  and  techniques  for 
analyzing  business-survey-data.   We  state  the  main  thesis  as  follows: 

PRIMARY  THESIS:  Empirical  models  of  learning  based  on  Artificial 
Intelligence  methods  and  techniques  represent  systems  that  provide  a 
useful  approach  for  examining  certain  research  questions  related  to  the 
1990-1991   recession,   using  data  collected  from  business  surveys  of 
consumer  attitudes  and  expectations.    Models  such  as  these  offer  a  new 
tool  for  processing  the  infotmation  contained  in  business  surveys. 
In  support  of  this  thesis  I  direct  my  research  around  three  activities:  (1)  defining  the  time- 
complexity  problem  of  feature  construction,  (2)  developing  a  procedure  to  help  solve  the 
problem,  and,  (3)  designing,  implementing,  and  testing  the  procedure  and  methods  using 
a  given  sample  of  data.     The  subtheses  of  these  three  activities  and  their  specific 
objectives  are  as  follows: 


43 
1.5.1    Problem  Definition 


Subthesis:       "^e  can  build  a  decision  tree  within  the  standard  time  complexity,  by 

adding  at  mostj  new  and  useful  features  to  the  existing  set  of  features.    Useful  features 

are  features  that  are  likely  to  be  selected  by  the  heuristics  used  to  construct  the  decision 

tree. 

Objectives: 

-Define  'feature  construction'  and  develop  an  analytical  model  showing  how  the 
temporal  behavior  of  the  algorithm  changes  as  we  add  additional  features. 

-Identify  the  difficulties  and  conditions  for  improving  the  computational 
efficiency  of  feature  construction. 

-Establish  general  conditions  that  must  be  satisfied  by  any  approach  for 
resolving  the  problem. 

1.5.2    Problem  Resolution 

Subthesis:  The  DVALTREE  procedure  for  fonning  useful  features,  produces  feature - 
sets  having  practical  sizes.  We  use  "useful"  in  the  sense  that  the  features  it  creates  are 
likely  to  be  used  by  tree-construction-heuristics  (i.e.,  given  two  features  having  high 
information  gain  for  a  subset  of  the  instances,  the  gain-ratio  criteria  selects  the  one 
giving  the  higher  proportion  of  split-information).  A  feature-set  has  a  practical  size  if, 
for  example,  it  does  not  contain  2"  features  when  using  a  set  of  n  primitive  attributes. 
Objectives: 

-Identify  and  examine  suitable  methods  for  forming  features. 


44 


-Use  the  conditions  given  by  the  problem  definition  to  explore  new 
approaches  to  feature  construction  in  a  computationally  efficient  way. 

-Identify  a  procedure  for  constructing  'useful'  features  such  that  the  time 
required  to  construct  a  new  decision  tree  using  these  features  and  features 
from  a  given  decision  tree,  is  on  the  order  lower  than  the  time  used  to 
produce  the  given  tree. 


1.5.3    Implementation  and  Experimentation 

Subthesis:  An  empirical  learning  model  using  Decision  Trees  and  DUALTREE  feature 
construction,  provides  'useful'  descriptions  from  the  BEBR  business  surveys,  allowing  us 
to  test  the  following  hypotheses: 

(1 )  Given  demographic  descriptions  for  the  four  categories  of  responses 
representing  the  respondents'  future  financial  condition,  descriptions  for  the  'don't 
know'  category  are  unique,  in  that  they  are  unlike  any  of  the  remaining  three,  to 
a  certain  extent,  and  should  not  be  combined  with  one  of  the  remaining  three  to 
examine  certain  changes  taking  place  during  the  J  990  -  J  991  recession. 

(2)  Changes  taking  place  in  the  'don't  know'  category  of  consumer 
expectations  for  personal  finances,  buying  plans,  and  the  national 
economy,  best  explain  the  change  in  consumers'  unwillingness  to  purchase 
cars.  Since  this  category  is  not  considered  in  most  methods  for  computing 
a  consumer-confidence-metric,  this  may  explain  why  consumer  spending 
was  over-predicted  for  motor  vehicle  purchases. 

Objectives: 

-Encode  an  algorithm  using  C++  for  DUALTREE. 

-Implement  and  empirically  test  two  hypotheses  using  the  BEBR  sample 
data. 

-Analyze  the  results  quantitatively  and  determine  the  relative  worth  of  the 
proposed  methods. 


45 
1.6   Dissertation  Outline 


The  following  chapters  describe  the  results  of  my  research.  Chapter  2  presents  a 
conceptual  model  of  consumer-household-consumption  and  its  relationship  to  consumer 
attitudes  and  well-being.  Chapter  3  introduces  basic  definitions  for  machine  learning, 
decision  trees,  and  reviews  operation  of  the  C4.5  Program.  In  Chapter  4  we  present  our 
analytical  model  of  the  time  complexity  of  building  decision  trees  using  feature 
construction.  We  describe  our  DUALTREE  algorithm  for  forming  features  in  Chapter  5, 
and  discuss  experimental  results  and  conclusions  in  Chapters  6  and  7  respectively. 


CHAPTER  2 
DESCRIBING  CONSUMER  HOUSEHOLDS 


An  objective  of  this  dissertation  is  to  construct  a  knowledge  base  of  information 
related  to  consumer  spending  using  AI  concepts  and  techniques.  Economists  and  business 
analysts  may  use  this  information  to  better  understand  decision  criteria  used  by  a  diverse 
group  of  consumers.  Pau  et  al.  (1989)  describe  how  the  exposure  of  economists,  banks, 
and  management  departments  to  AI  through  knowledge  based  systems,  natural  language 
analysis,  or  symbolic  programming  environments,  has  increased  since  the  mid  1980's. 
Their  work  supplies  a  structured  collection  of  known  projects,  organizations  involved,  and 
tools/methods  used,  in  a  variety  of  applications  of  AI  in  Economics  and  Management. 
Additionally,  the  authors  list  several  areas  posing  unresolved  challenges  to  AI,  such  as 
policy  analysis,  pubhc  services,  and  forecasting.  Economic  forecasting,  for  example,  is 
an  activity  resulting  in  a  set  of  predictions  produced  by  a  forecaster  or  forecasting  method 
pertaining  to  estimations  of  present  or  future  demand  along  with  present  or  future  need. 
This  activity  also  requires  a  model  of  decision-making  by  consumer-households  with 
respect  to  the  choice  of  goods  and  services  used  in  living,  along  with  other  relationships 
and  activities  stemming  from  their  choices  (Cochrane  and  Bell  1956). 

This  dissertation  focuses  on  using  machine-learning  methods  to  obtain  meaningful 
descriptions  of  consumer-households  consuming  different  amounts  of  durable  goods  and 
services  such  as  major  household  appliances,  houses  and  automobiles,  or,  having  different 

46 


47 
levels  of  discretionary  and  postponable  expenditures.      We   use   the  terms  family, 

household,   and   consumer  interchangeably   to  refer  to  the  concept  of  a  consumer 

household.    A  consumer  household  is  a  financially  independent  entity  in  which  one  or 

more  people  live  together  who  pool  their  income  to  make  joint  expenditure  decisions. 

Financial  independence  is  determined  by  the  three  major  expense  categories  of  housing, 

food,  and  other  living  expenses  (U.S.  Bureau  of  Labor  Statistics  1989).    Katona  and 

Mueller  (1956)  suggests  that  short-term  changes  in  people's  appraisals  of  trends  in  their 

economic  welfare  can  be  attributed  in  large  part  to  variations  in  business  conditions  as 

they  are  perceived  by  and  affect  the  individual  household.  Also,  economists  regard 

changes  in  consumer  preferences,  attitudes  and  expectations  as  a  type  of  scientific  data 

which  is  as  reliable  as  changes  in  income,  price,  and  the  like  (Katona  and  Mueller  1956; 

Juster  1959;  Katona  1960;  Juster  1964).    Using  survey  data  of  consumer  attitudes  and 

expectations,   we   seek  useful  descriptions  of  consumer  households  having  different 

appraisals  of  trends  in  their  financial  situation,  such  as  'better  off,  'the  same',  'worse 

off,  and  'don't  know'.    Additionally,  we  prefer  descriptions  in  terms  of  household  or 

consumer  characteristics  such  as  race,  income,  party  affiliation,  age,  sex,  occupation,  and 

the  like.     The  following  sections  discuss  (1)  the  flow  of  durable  market  goods  and 

services  within  the  household,  and,  (2)  approaches  for  using  survey  data  to  aid  in 

determining  a  demand  function  for  durable  goods. 


48 

2.1    Consumer  Consumption  of  Durable  Goods 

Katona  (1960)  applies  two  key  propositions  to  the  study  of  consumer  motives, 
habits,  attitudes  and  expectations  on  consumer  spending.  They  are  that  demand  depends 
on  income  and  confidence,  and  that  changes  in  confidence  are  measurable.  Examining 
questions  concerning  the  subjective  saliency  of  consumer  needs  and  the  transformation 
of  these  needs  into  demand,  the  author  reports  that  consumers'  discretionary  expenditures 
are  a  function  of  several  consumer  attitudes  such  as  one's  view  of  their  personal  financial 
situation,  what  happens  to  other  members  of  the  household  and  community,  and  what 
happens  to  the  country.  A  major  thesis  of  Katona  and  Mueller  (1956)  was  that  consumer 
demand,  especially  for  durable  goods,  is  a  function  of  both  ability  to  buy  as  measured  by 
data  on  income,  assets,  debts  and  the  like,  and,  willingness  to  buy  as  measured  by 
attitudinal  and  expectational  questions  in  surveys. 

This  dissertation  examines  consumer  motives,  attitudes  and  expectations,  and  their 
relationships  to  consumer  spending.  We  use  Al  concepts  and  techniques  to  develop  an 
empirical  model  for  investigating  responses  to  the  attitudinal  and  expectational  questions 
found  in  business  surveys.  Advantages  of  using  a  knowledge-based  approach  over  the 
commonly  used  traditional  statistical  techniques  include  ( 1 )  AI  provides  a  framework  for 
working  with  a  large  number  of  attributes  which  may  have  very  complex  relationships, 
(2)  AI  methods  provide  a  framework  for  dealing  with  both  the  quantitative  and  qualitative 
scaled  attributes,  and,  (3)  AI  provides  a  framework  allowing  for  the  explicit  description 
of  the  economic  agent's  process  and  the  economists'  behavior,  in  terms  of  a  given  set  of 
attributes  like,  for  example,  developing  an  Al-based  system  that  integrates  an  existing 


49 

econometric  model  with  an  'expert'  knowledge  base  to  explain  and  compute  many  of  the 

factors  which  are  out  of  the  scope  of  the  model.  The  next  sections  describe  an  approach 
we  use  to  study  household  consumption  of  commodities,  incorporating  business  surveys 
of  consumer  preferences,  attitudes  and  expectations.  Following  this,  we  discuss  the 
survey  data  and  the  research  questions  we  investigate  in  this  dissertation. 

2.1.1    Estimating  a  Demand  for  Commodities 

Economists  and  business  analysts  labor  to  determine  a  demand  function  for 
durable  goods  and  services,  with  respect  to  the  household-sector—business-sector 
relationship.  We  assume  that  human  resources,  income,  wealth  and  price  all  constrain 
consumption.  Also,  other  characteristics  of  income,  as  well  as  the  amount  of  income, 
influence  household  consumption  choices— namely  (1)  regularity  and  certainty  of  income 
may  affect  the  proportion  of  income  used  for  current  consumption,  (2)  expectations 
regarding  future  income  may  affect  the  savings  rate  and  willingness  to  pay  for  current 
consumption  with  credit,  and,  (3)  sources  of  income  and  the  number  of  earners  may  affect 
decisions  about  income  use  (Cochrane  and  Bell  1956;  Juster  et  al.  1981;  Magrabi  et  al. 
1991).  Considering  consumer  expectations  regarding  their  future  financial  status,  we  want 
to  examine  how  these  expectations  changed  before,  during,  and  after  the  1990-1991 
recession.  To  do  this,  we  require  'representative'  descriptions  of  consumer-households 
for  different  comparative  financial  states  such  as  better  off,  the  same,  worse  off,  and  don't 
know,  in  teims  of  a  fixed  set  of  consumer-characteristics  or  attributes,  such  as  age, 
income,  and  party  affiliation.    Our  empirical  analysis  requires  a  model  capable  of  (1) 


50 
handling   a   large   number  of  attributes,   and,   (2)   providing  results   that  are  easily 

interpretable  by  humans. 

Magrabi  et  al.  (1991)  review  several  theoretical  approaches  used  to  study 
household  consumption  of  commodities.  These  include  utility  functions,  consumption  and 
savings  functions,  household  production  theory,  life-style  and  life-quality  approaches  and 
others.  Their  results  reveal  no  single  logically  coherent  theory  adequate  for  the  analysis 
of  all  aspects  of  household  consumption  behavior.  The  authors  highlight  the  strengths  of 
many  existing  theories  and  conceptual  constructs,  and  also  describe  research  models  that 
use  combined  concepts  of  two  or  more  theoretical  approaches.  The  literature  offers 
several  key  results  associated  with  research  models  based  on  many  of  these  concepts.  For 
example,  according  to  Suranyi-Unger  (1977),  the  majority  of  Americans  adhere,  to  a 
greater  or  lesser  extent,  to  some  institutionalized  common  life-style.  He  suggests  that 
such  life-style  groups  or  'standard  classes'  may  be  identified  either  with  respect  to  the 
similarity  of  their  behavioral  patterns  (e.g.,  spending  patterns),  or  with  respect  to  their 
demographic  characteristics.  Mitchell  (1983)  offers  a  comprehensive  classification  of  life- 
style types,  based  in  part  on  developmental  psychology  drawing  on  Maslow's  hierarchy 
of  needs.  A  unique  way  of  life  represented  by  each  type  is  described  in  terms  of 
demographics,  attitudes,  financial  status,  and  use  or  ownership  of  selected  consumer 
goods.  Our  aim  here  is  to  show  the  significance  of  demographic  data  in  models  based 
on  many  of  the  theories.  Indeed,  Ketkar  and  Cho  (1982)  show  that  various  demographic 
factors  such  as  the  age  of  the  household  head,  his/her  educational  attainment,  the 


51 
employment  status  of  the  household  head  and  spouse,  the  household's  race  and  region  of 

location,  all  determine   expenditure  patterns  in  the  United  States. 

Recall  that  economic  forecasting  is  an  activity  resulting  in  a  set  of  predictions  of 
present  or  future  demand  along  with  present  or  future  need.  Economists  and  Business 
Analysts  use  various  econometric  models  to  develop  forecasts.  Business  surveys  are 
commonly  incorporated  in  their  forecasting  models  that  give  an  averaged  view  of  the 
economy  (Katona  and  Mueller  1956;  Juster  1966;  Zamowitz  1967;  Gianotti  1989;  Palies 
and  Philip  1989).  Business  surveys  consist  of  a  relatively  systematic  standardized 
approach  to  collecting  the  information  for  each  category  of  answers  to  a  set  of  questions. 
Knowledge  about  how  to  properly  construct  and  administer  the  survey  instrument  is  found 
in  Rossi  et  al.  (1983)  and  will  not  be  discussed  here. 

A  general  rule  for  interpreting  surveys  is  that  an  answer  is  a  function  of  the 
question  (Katona  and  Mueller  1956).  When  the  set  of  questions  is  designed  to  elicit 
consumer  plans  or  intentions  to  buy  certain  goods,  Juster  (1966)  interprets  these  plans  or 
intentions  to  buy  as  reflecting  the  respondent's  estimate  of  the  probability  that  the  item 
will  be  purchased  within  the  specified  time  period.  The  survey  instrument  used  in  this 
research  is  designed  to  examine  the  interrelationships  of  consumer  pmchases  and  buying 
intentions  to  consumer  attitudes  and  expectations.  The  following  sections  describe  the 
demographic  data  used  in  our  work,  along  with  the  attitudinal  siuA'eys  containing  this 
information. 


52 
2.2   Survey  Data  of  Consumer  Households 

Fluctuations  in  the  ratio  of  consumer  purchases  of  durable  goods  to  disposable 
income  is  of  key  interest  to  economists  concerned  with  upswings  and  downswings  in 
business  activity  (Juster  1959).  The  earliest  surveys  eliciting  anticipatory  data  to  use  for 
predicting  the  demand  for  durable  goods  began  in  1945  by  the  Survey  Research  Center 
(SRC)  at  the  University  of  Michigan  (Juster  1966).  Generally,  economists  use  survey 
data  to  create  an  index  of  consumer  attitudes  associated  with  subsequent  changes  in 
purchases  of  durable  goods  (Juster  1959).  This  dissertation  uses  survey  data  to  study 
certain  psychological  factors  of  consumers  as  a  function  of  their  responses  found  in 
business  surveys.  This  data  represents,  among  other  things,  the  fixed  characteristics  or 
factual  information  about  consumers  such  as  age,  education,  income,  and  party  affiliation. 

Pioneering  research  on  using  survey  data  of  consumer  attitudes  was  performed  at 
the  University  of  Michigan's  SRC  by  Katona  and  Mueller  (1956) .  Noting  that  consumer 
demand  depends  on  both  ability  to  buy  and  willingness  to  buy,  they  claimed  that  changes 
in  consumer  optimism  strongly  influence  the  rate  of  consumer  spending  on  discretionary 
or  postponable  items.  Katona  and  Mueller  (1956)  did  not,  however,  simply  ask  people 
whether  they  planned  to  buy  automobiles  or  major  household  appliances.  Instead  they 
created  a  composite  index  of  consumer  sentiment  using  their  surveys,  incorporating 
several  indicators  measuring  changes  in  consumer  buying  intentions,  attitudes  and 
expectations.  Two  key  reasons  for  using  several  components  to  determine  consumer 
sentiment  are  (1)  buying  inclinations  may  depend  on  a  variety  of  attitudes,  and,  (2) 


53 
answers  to  single  questions  are  unreliable,  depending  upon  personal  circumstances,  the 

mood  of  respondents  and  question  wording.   Katona  and  Mueller  (1956)  also  state: 

The  need  for  preparing  a  summary  measure  of  changes  in  consumer  attitudes 
became  particularly  clear  when  recently  calculations  were  published  which 
compared  changes  in  the  answers  to  single  attitudinal  questions  with  aggregate 
durable  goods  sales.  This  procedure  assumes  that  each  individual  attitude,  taken 
in  isolation,  must  have  a  specific  relation,  unchanged  over  time,  to  consumer 
behavior.  The  unitary  nature  of  psychological  wholes  composed  of  divergent 
parts,  as  well  as  the  multiplicity  of  human  motivations  (some  motives  reinforcing 
one  another  and  others  conflicting  with  one  another)  are  disregarded.  As  Gestalt 
theory  has  shown,  a  part  or  item  may  change  its  meaning  and  function  according 
to  the  whole  to  which  it  belongs.    (92) 

Another  significant  contributor  is  F.  Thomas  Juster  (1959).  Juster's  early  research 

showed  that  (1)  expectational  and  financial  variables  are  more  closely  associated  with 

short-horizon  and  definite  buying  plans  than  with  longer-horizon  and  indefinite  ones,  and, 

(2)  it  is  difficult  to  determine  which  expectational  and  financial  factors  are  most  closely 

associated  with  buying  plans  and  purchases  because  of  strong  interrelationships  among 

the  variables.   Another  key  research  result  given  by  Juster  (1966)  confirmed  the  finding 

by  Katona  and  Mueller  (1956)  that  surveys  of  consumer  intentions  to  buy  are  inefficient 

predictors  of  purchase  rates  because  they  do  not  provide  accurate  estimates  of  mean 

purchase  probability.  His  experiments  verified  the  hypothesis  that  the  basic  predictors  of 

purchase  rates  given  by  an  intentions  survey— the  proportions  of  intenders  (respondents 

who  answer  'yes')  and  nonintenders  (respondents  who  answer  'no')  in  the  sample-are 

inefficient   predictors    because    the    mean    purchase    probabilities    of   intenders    and 

nonintenders  vary  over  time.     These  results  led  him  to  develop  purchase  probability 

surveys,  which  are  still  used  to  create  indexes  of  consumer  confidence. 


54 
Today,  several  indices  of  consumer  confidence  regularly  appear  in  a  variety  of 

business  publications.    These  include  the  Consumer  Confidence  Index  (CCI)  published 

by  the  Bureau  of  Economic  and  Business  Research  (BEBR)  at  the  University  of  Florida, 

and   the   Index   of  Consumer   Sentiment   (ICS)   and   Consumer  Expectations--at  the 

University  of  Michigan.   The  respective  indexes  of  consumer  confidence  usually  appear 

monthly,  and  these  two  use  identical  components  for  creating  the  index.  We  obtained  the 

data  used  in  this  research  from  the  BEBR  survey  of  consumer  confidence. 

Eichhorn  et  al.  (1978)  define  an  economic  index  as: 

DEFINITION  [Economic  Index,  (Eichhorn  et  al.  1978)] 

An  economic  index  is  an  economic  measure,  i.e.,  a  function 

[0,1]  F:D=>91 

which  maps,  on  the  one  hand,  a  set  D  of  economically  interesting  objects  into  the  set   9\ 

of  real  numbers  and  which  satisfies,  on  the  other  hand  a  system  of  economically 
relevant  conditions  (for  instance,  monotonicity  and  homogeneity  or  homotheticity 
conditions).  The  form  of  these  conditions  depends  on  the  economic  information  which 
we  want  to  obtain  from  the  particular  measure.    (3) 

For  the  previous  definition,  [0,1  J  represents  the  set  of  real  numbers  between  zero  and  one 

inclusively.    The  next  section  describes  the  BEBR  survey  and  the  various  components 

used  to  construct  the  BEBR's  composite  confidence-index.     We  then  describe  the 

consumer-household  characteristics  or  attributes  used  in  our  work. 

2.2.1    BEBR  Survey  of  Consumer  Confidence 

The  BEBR  is  an  applied  research  center  in  the  College  of  Business  Administration 
at  the  University  of  Florida.    Founded  in  1929,  its  primary  role  is  to  conduct  applied 


55 

research  focusing  on  the  State  of  Florida.   The  BEBR  Survey  Program,  starting  in  1983, 

administers  a  monthly  sample  survey  of  500-600  households  in  Florida.  The  sample  of 
households  is  generated  through  random-digit  dialing  of  telephone  numbers  throughout 
the  state.  The  numbers  are  called  to  identify  a  household  and  an  adult  (18  or  older) 
respondent.  The  survey  is  designed  to  collect  data  on  consumer  attitudes  about  various 
business  and  economic  conditions,  and,  the  demographic  and  socioeconomic 
characteristics  of  Floridians.  The  survey  and  the  procedure  for  constructing  the  index,  are 
patterned  after  the  University  of  Michigan's  national  index. 

Figure  2.1  shows  the  trend  of  both  the  BEBR's  CCI,  and  the  University  of 
Michigan's  ICS,  for  the  time  period  January  1992  -  May  1993.  The  graph  shows  that  the 
indices  track  fairly  well  together,  thus,  we  may  infer  that  the  confidence  of  Floridians  is 
a  reasonable  proxy  of  consumer  confidence  across  the  nation.  Note  that  this  supports  our 
previous  conclusion  regarding  similar  data  for  the  time  period  January  1989  -  December 
1992  (see  Chapter  1).  Next,  we  discuss  the  BEBR-survey  questions  used  for  eUciting 
consumer  preferences,  attitudes  and  expectations. 
2.2.1.1    BEBR  index  components 

Five  components  are  used  in  calculating  the  BEBR's  CCI.  Three  of  these  are 
indicators  of  consumer  expectations  for  personal  finances  and  buying  plans,  and  the  other 
two  are  indicators  of  consumer  expectations  of  the  national  economy.  Appendix  A  shows 
the  survey  questions,  variable  assignments,  and  range  of  responses  for  each  component 
of  the  composite  index  as  well  as  for  the  demographic  information.  The  index- 
component-questions  are  given  in  the  order  in  which  they  are  read  to  the  respondents. 


56 


Indices  of  Consumer  Confidence 


105 
100 
95 
90 
85  ^ 
BO 
75  ^ 
70 
65-1 
60 
month 


Jan  '92  -  May  '93 


Legend 
-  BEBR 

-  Michigan 


—1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 — 

J  FMAMJ  JASONDJ  FMAM 


Figure  2.1    CCI's  for  Jan  '92  -  May  '93 


2.3   Describing  Purchasers  of  Durable  Goods 


A  primary  purpose  of  this  research  is  to  develop  and  evaluate  a  tool  for  examining 
data  found  in  business  surveys.  Using  the  survey  data,  we  seek  an  empirical  model  that 
gives  useful  descriptions  of  consumers  having  higher  inclinations  to  purchase  durable 
goods,  assuming  that  consumers  in  the  'better'  category  have  higher  levels  of 
discretionary  and  postponable  consumer  expenditures.   Additionally,  we  want  to  explore 


57 
how  major  motivational  forces  (i.e.,  age,  income,  party  affiliation)  influence  consumer 

purchases  of  durable  goods  during  the  short-term  horizon  when  unforseen  or  imperfectly 

foreseen  events  occur. 

As  an  illustration  of  the  way  the  BEBR  survey  data  is  used,  in  the  BEBR  Florida 

Consumer  Confidence  Index  press  release  dated  June  3  1992.  Dave  Denslow,  a  University 

of  Rorida  Research  Economist,  reports  on  various  results  associated  with  the  survey  data. 

He  states: 

The  stronger  demand  for  housing  stems  from  rising  employment  and  falling 
mortgage  rates.  Along  with  it  comes  greater  optimism  about  near-term  prospects 
for  the  national  economy.  The  share  of  respondents  expecting  the  national 
economy  to  revive  during  the  coming  year  rose  to  41  percent  in  May,  up  from  36 
in  April.    (1) 

This  example  illustrates  a  case  where  we  see  a  connection  made  between  'home-buying 

plans'  and  consumer  expectations  for  the  national  economy.     As  another  example, 

reporting  on  a  'sagging'  confidence  index,  in  the  press  release  dated  August  3  1992, 

Economist  Denslow  states: 

By  itself  the  July  change  is  trivial.  More  troubling  is  the  way  confidence  has 
stalled.  Our  index  rebounded  from  the  60s  in  January,  but  except  for  the  81 
registered  in  April  it  has  been  stuck  in  the  70s  ever  since.  Only  if  it  climbs  well 
into  the  80s  can  we  expect  consumer  spending  to  surge.    (1) 

For  this  case,  a  connection  is  made  between  'climbing  consumer  confidence"  and  a 

'surge*  in  consumer  spending. 

To  illustrate  how  the  data  may  be  used  in  other  ways,  let's  consider  people's 

replies  to  the  question,  "Now,  looking  ahead-do  you  think  that  a  year  from  now  you  (and 

your  family  living  there)  will  be  better  off  financially,  or  worse  off,  or  just  about  the 

same  as  now  ?".     Note  that  Figure  2.1   shows  a  fairly  stable  confidence  index  from 


58 


peicent 

120 

110 

100 

90 

80 

70 

60 

50 

40 

30 

20 

10 

0 


Financial  Expectations 


Jan  '92  -  Sep  '92 


JAN     FEB    MAR    APR    MAY    JUN     JUL    AUG    SEP 

months 

FUT-CONF  -">  BEBR  future-financial-component  confidence-value 

Figure  2.2   Consumers"  financial  confidence  during  1992 


Legend 
FUT-CONF 
BETTER 
SAME 
WORSE 


January  through  September  of  1992.  Also  during  this  time,  consumers  witnessed  a 
considerable  amount  of  intense  debate  regarding  the  candidates  for  the  1992  Presidential 
Elections.  Figure  2.2  shows  consumer  confidence  of  their  future  financial  situation,  along 
with  answer  percentages  for  the  respondents  replies.  We  see  that,  for  the  time  period, 
consumers'  level  of  confidence  for  their  future-financial-state  remained  fairly  stable.  Note 
that  for  Figure  2.2  and  the  following  graphics,  FUT-CONF  represents  the  confidence  level 
or  value  for  the  'future-financial-component'  of  the  BEBR  composite  index. 


59 


Table  2.1    Income  and  age  distributions  of  financial  confidence 

INCOME    FUT-COJSTF     %BETTER     %SAME     %  WORSE    #  obs 


<$25K 
$25K-$45K 
$45K-$75K 
>$75K 

91.2 
98.7 
102.1 
111.3 

35.39 
42.12 
45.55 
54.87 

49.72 
46.45 
44.17 
37.99 

14.89 

11.43 

10.28 

7.14 

1424 

1339 

652 

308 

AGE 

FUT-CONF 

%BETTER 

%SAME 

%WORSE 

#obs 

18  -44yrs 
45-65 
>  65  yrs 


108.3 
85.6 

73.7 


51.95  39.67  8.38  2566 
30.5  52.01  17.5  1046 
15.25     66.34     18.41   918 


18-44 

45-65 

>  65  yrs 

<$25K 

104.5 
(843) 

79.3 
(274) 

70.9 
(359) 

$25K-$45K 

109.3 
(843) 

87.4 
(300) 

70.7 
(196) 

$45K-$75K 

108.9 
(418) 

90.1 
(179) 

70.7 
(89.5) 

>$75K 

119.3 
(188) 

98.6 

(95) 

99.7 

(25) 

LEGEND: 


FUT-CONF 

(#obs) 


Table  2. 1  shows  consumer  confidence  for  their  future  financial  situation,  for  this 
time  period,  distributed  by  income  and  age  groups  of  consumers.  The  table  shows  a 
confidence  level,  based  on  our  question,  and,  the  distribution  of  respondents'  replies. 
Considering  the  income  categories  shown  on  the  table,  we  see  that,  for  Jan  '92  through 
Sep  '92,  in  general,  as  consumers'  income  levels  go  up,  they  reported  higher  levels  of 
confidence.  For  the  three  age  groups  shown  in  the  table,  we  observe  that  as  consumers 
get  older,  their  levels  of  confidence  decreased.  In  the  lower  part  of  Table  2. 1  we  see 
future  financial  confidence  levels  for  the  survey  data  grouped  by  income  and  age.    An 


60 

interesting  observation  is  that  consumers  between  the  ages  of  18  and  44  reported  a  fairly 

high  level  of  confidence  regarding  their  future  financial  condition,  regardless  of  their 
income  level! 

As  another  example.  Table  2.2  shows  distributions  of  financial  confidence  using 
the  sex  and  party  affiliation  of  consumers.  From  the  table,  essentially,  male  consumers 
reported  higher  levels  of  futiu'e  financial  confidence  than  female  consumers.  Also,  of  the 
three  party  affiliations  of  consumers.  Republicans  reported  the  highest  levels  of 
confidence,  and  Democrats  reported  the  lowest  confidence  levels.  Note  from  the  cross- 
tabulation  in  the  lower  half  of  Table  2.2  that  consumers  who  were  members  of  the 
Democratic  party  reported  essentially  the  same  level  of  future-financial-confidence, 
regardless  of  their  sex\ 

Recall  from  Chapter  1  that  we  proposed  two  hypotheses  regarding  the  BEBR 
business  surveys  and  'useful'  descriptions  of  consumer-households.  For  the  first 
hypothesis,  we  need  consumer-household  demographic  descriptions  for  the  four  categories 
of  responses  representing  the  respondents'  future  financial  condition,  in  order  to  see  if 
descriptions  for  the  'don't  know'  category  are  unique,  during  the  recent  recession.  For 
the  second,  we  need  descriptions  of  consumer  purchase  plans  for  automobiles,  in  terms 
of  their  attitudes  and  expectations  (i.e.,  CURFIN,  FUTFIN,  USFUFI,  USNEX5,  and 
GBTIME),  keeping  the  four  categories  of  answers  mutually  exclusive.  Given  that  we 
have  obtained  such  descriptions,  the  next  section  describes  an  approach  for  testing  our 
hypotheses. 


61 


Table  2.2    Sex  and  party  distributions  of  confidence 

SEX        FUT-CONF     %BETTER     %SAME     %  WORSE    #  obs 


MATE 
FEMATE 

99.0 

93.7 

43.74 
36.31 

43.54 
51.33 

12.73 
12.35 

1980 
2550 

PARIY 

FUT-CONF 

%BETTER 

%SAME 

%WORSE 

#obs 

Republican  105.0 
Democratic  89.9 
Independent       94.7 


47.75 
33.41 
39.15 


43.67 
51.96 
47.00 


8.58 
14.63 
13.84 


1399 

1326 
1134 


^;r^--^ 

Republican 

Democratic 

Independent 

MATE 

109.8 
(661) 

90.0 

(504) 

97.1 
(543) 

FEMAI.E 

100.7 
(738) 

89.9 

(822) 

92.5 
(591) 

LEGEND: 


FUT-CONF 
(#obs) 


2.3.1    Experimental  Design  using  the  BEBR  Business  Surveys 


We  used  eleven  fixed  attributes  of  consumer-households  in  our  attempt  to  establish 
a  relationship  for  consumers'  perceived  future  financial  conditions.  The  consumer 
information  captured  by  our  attributes  includes:  (1)  age,  (2)  level  of  education,  (3) 
employment  status,  (4)  number  of  people  living  in  the  household,  (5)  family  annual 
income,  (6)  marital  status,  (7)  household  residence  in  a  metropolitan  statistical  area,  (8) 
job  category,  (9)  political  party  affiliation,  (10)  racial  background,  and,  (11)  sex. 


62 
These  demographic  attributes  are  used  in  many  of  the  econometric  models  found  in  the 
literature.  Appendix  A  lists  the  alternative  values  for  the  demographic  data  as  well  as  the 
values  for  the  'time'  attribute.  For  the  'TIME'  dimension,  we  examine  the  survey 
responses  for  one  quarter  preceding  the  recessionary  episode,  the  quarters  of  the  recession, 
and  one  quarter  following  it  (i.e.,  the  recovery  period).  This  time  span  represents  an 
event  where  the  CCI  swung  from  a  relatively  high  level  to  a  relatively  low  level,  and 
back  again  to  a  relatively  high  level.  Also,  the  'regional'  information  for  a  consumer- 
household  is  captured  using  the  'MSA'  attribute.  The  'MSA'  value  reflects  whether  the 
consumer  resides  in  a  Metropolitan  Statistical  Area.  For  the  BEBR  data,  these  are 
geographic  units  for  economic  analysis  located  in  the  state  of  Florida,  however,  the 
concept  of  metropolitan  areas  has  a  national  interpretation. 

We  want  to  test  one  dependent  variable  for  each  of  the  hypotheses  given  in 
Chapter  1  regarding  the  BEBR  data.  For  the  first  hypothesis,  the  dependent  variable  is 
given  by  the  respondent's  answer  to  the  future-financial-condition  question  on  the  survey 
(i.e.,  FUTFIN).  Keeping  the  categories  of  answers  mutually  exclusive,  (i.e.,  'better', 
'same',  'worse',  and  'don't  know'),  we  test  to  see  if  descriptions  for  the  'don't  know' 
category  are  unique.  Note  that  in  the  previous  illustrations  regarding  the  BEBR  survey 
data,  the  "don't  know'  answers  were  combined  with  answers  of  the  'same',  when  we 
formed  the  future-financial-component  confidence-value  (FUT-CONF). 

The  dependent  variable  for  the  second  hypothesis  is  given  by  the  respondent's 
answer  to  the  survey  question  asking  whether  anyone  in  the  household  plans  to  buy  a  car 
or  truck.    The  independent  variables  for  this  case  are:  CURFIN,  FUTFIN,  USFUFI, 


63 
USNEX5,  and  GBTEME.    These  attributes  represent  the  attitudes  and  expectations  of 

consumers.  Given  descriptions  of  consumers'  intentions  to  purchase  automobiles  over  the 

recessionary  and  recovery  periods,  we  want  to  study  these  descriptions  to  see  if  they 

represent  plausible  ones. 

The  following  chapters  describe  an  empirical  tool  we  developed  that  produces  the 

type  of  consumer-household  descriptions  we  desire.  We  discuss  the  experimental  results 

given  by  our  model  using  the  BEBR  data,  in  the  chapter  describing  our  experiments.   In 

the  final  chapter,  we  discuss  the  significance  of  our  results. 


CHAPTER  3 
MACHINE  LEARNING 


3.1    Background  in  Artificial  Intelligence 

The  field  of  Artificial  Intelligence  (AI)  focuses  on  designing  or  describing  systems 
normally  associated  with  activities  undertaken  by  humans.  Thus,  the  fundamental 
interests  of  AI  research  include  modeling  activities  such  as  understanding  natural  (i.e., 
human)  languages,  problem  solving,  and  more. 

One  way  to  describe  Artificial  Intelligence  is  by  using  references  to  its  active  areas 

of  research  and  to  the  many  applications  developed  in  the  field.    The  AI  applications 

developed  to  date  help  subdivide  the  field  into  the  following  disciplines: 

-Languages  and  Environments  for  AI 

-Natural  Language  Understanding  and  Semantic  Modeling 

-Modehng  Human  Performance 

-Automated  Reasoning  and  Theorem  Proving 

-Game  Playing 

-Planning  and  Robotics 

-Pattern  Recognition 

Reasoning  systems  are  the  focus  of  this  dissertation. 

3.2    Reasoning  Systems 

Initial  work  in  AI  reasoning  systems  during  the  1950's  and  1960's  was  largely 
unsuccessful  and  too  ambitious  for  the  computing  models  and  equipment  of  that  era.  As 


64 


65 

a  result,  researchers  refocused  their  efforts  and  concentrated  on  search  methods  and 

knowledge  representation. 

Researchers  achieved  many  advances  in  search  and  knowledge  representation  in 
the  1970's.  However,  the  application  areas  (such  as  medicine,  chemistry,  mathematics 
and  the  like)  were  still  too  ambitious. 

A  leading  researcher,  Edward  Feigenbaum,  suggested  that  developers  limit 
reasoning  systems  to  areas  where  they  can  meaningfully  capture  and  apply  human 
expertise.  This  suggestion  resulted  in  the  birth  of  Expert  Systems.  Expert-systems 
development  is  currently  a  very  active  area  yielding  significant  returns  for  using  human- 
expert  knowledge  in  the  form  of  computer  code.  In  the  mid- 1 980 's,  many  Expert  System 
(ES)  Shells  became  commercially  available.  An  ES  shell  contains  software  to  (1) 
maintain  a  Knowledge-base  (KB),  (2)  reason  with  the  knowledge  to  solve  a  problem,  and, 
(3)  communicate  with  the  user.  It  does  not  contain  any  specific  knowledge  about  a 
domain  of  interest-hence  the  term  'shell'.  One  must  obtain  the  knowledge  and  load  it 
into  an  expert  system  shell.  The  process  of  knowledge  acquisition  has  undergone 
extensive  research. 

The  most  common  methods  of  knowledge  acquisition  involve  interviewing 
methods.  A  Knowledge  Engineer  (KE)  interviews  the  Domain  Experts  (DE).  The  KE 
then  translates  the  information  into  the  form  of  knowledge  representation  needed  by  the 
ES  shell. 

Knowledge-Acquisition  is  a  time  consuming  process  that  has  bedeviled  many 
attempts  at  fielding  ES  applications.    Feigenbaum  (1983)  states  that  the  "knowledge- 


66 
engineering-bottleneck"  severely  limits  the  practical  development  of  knowledge-based 

systems.    He  associates  the  'bottleneck'  with  using  both  domain  experts  and  computer 

engineers  to  build  expert  systems.    This  practice  is  both  costly  and  difficult.    Many  AI 

practitioners  are  now  dissolving  the  bottleneck  by  developing  programs  which  begin  with 

a  minimal  amount  of  information,  if  any,  and  'learn'  on  their  own. 

This  research  focuses  on  Knowledge-Acquisition  techniques  using  Decision  Trees 

and  Decision  Lists.   The  remaining  sections  of  this  chapter  describe  Machine  Learning, 

Decision  Trees,  and  Decision  Lists. 

3.3    Machine  Learning 

There  are  two  major  approaches  to  studying  learning.  Cognitive  scientists  try  to 
develop  theories  and  models  of  learning  observable  in  humans  and  other  animals. 
Researchers  in  Artificial  Intelligence  develop  theories  and  models  of  any  type  of  learning. 
These  theories  and  models  do  not  necessarily  involve  living  organisms.  The  focus  of  this 
research  is  on  theories  and  models  of  learning. 

3.3.1    Views  of  Leaming 

Cohen  and  Feigenbaum  (1982)  list  four  views  of  learning:  (1)  any  process  by 
which  a  system  improves  its  performance;  (2)  the  acquisition  of  explicit  knowledge;  (3) 
skill  acquisition;  and  (4)  theory  formation  and  discovery. 

The  first,  improving  performance  on  a  given  problem,  is  the  most  studied  form  of 
leaming.   Valiant  (1984)  describes  learning  as  a  'phenomenon'  of  knowledge  acquisition 


67 
in  the  absence  of  explicit  programming.   This  view  of  learning  grew  out  of  research  in 

problem  solving. 

The  acquisition  of  explicit  knowledge  is  a  more  Umited  view  of  task  performance. 
Skill  acquisition  refers  to  the  phenomenon  whereby  one  becomes  more  proficient  at  a  task 
with  practice.  Finally,  theory  formation  and  discovery  views  learning  from  the  process 
of  scientific  discovery  of  principles  and  theories. 

Our  focus  is  on  the  fu-st  two  views  of  learning.  Hence,  we  will  view  Machine 
Learning  as  developing  information  processing  systems  which  expand  their  knowledge 
base  and  exhibit  improved  performance. 

3.3.2    A  Model  of  a  Learning  Machine 

Figure  3.1  shows  the  components  of  a  learning  machine  (Cohen  and  Feigenbaum 
1982).  The  task  of  the  performance  element  is  the  focus  of  the  learning  system.  The 
learning  element  tries  to  improve  the  performance  element.  It  also  bridges  any 
information  gaps  between  the  environment  and  the  performance  element. 

The  environment  consists  of  all  information  required  by  the  machine.  This  is  the 
application's  domain.  Many  applications  use  the  closed  world  assumption.  This  states 
that  anything  not  derivable  from  the  given  facts  is  either  irrelevant  or  false. 

The  learning  element  will  sample  the  environment  to  acquire  knowledge  that  will 
improve  the  machine's  performance.  We  call  this  sample  a  training  set.  The  training  set 
provides  a  sample  of  instances  from  the  domain  of  interest.  The  space  of  all  possible 
instances  defines  the  instance  space. 


68 


(environment  o 


Figure  3.1    Learning  machine  model 


The  knowledge  base  contains  the  facts  and  rules  derived  by  the  learning  element. 
The  management  of  a  large  knowledge  base  is  also  a  problem  of  AI  research.  The 
problems  are  similar  to  designing  a  database  management  system.  The  knowledge  must 
be  easily  searched,  retrieved,  modified  and  stored.  The  'organized  information',  or 
knowledge,  can  be  specific,  general,  procedural,  declarative,  exact  or  fuzzy.  Procedural 
knowledge,  for  example,  exists  as  a  set  of  instructions  used  to  solve  a  problem.  Typical 
knowledge  representation  methods  include  frames,  constraints,  production  rules,  and 


69 
mathematical  logic.    Production  rules,  for  example,  are  rule-based  schemes  which  use 

procedural  knowledge.  The  procedure  for  solving  a  problem  in  a  given  domain  exists  as 

a  set  of  'if...then...'  rules. 

The  learning  element  consists  of  the  various  procedures  and  functions  employed 
to  expand  the  knowledge  base  or  improve  the  machine's  ability  to  perform  one  or  more 
tasks.  It  functions  as  an  'automated'  knowledge  acquisition  tool  within  the  learning 
machine  model  to  delimit  the  rule  space.  One  of  its  important  function  is  hypothesis 
formation.  Through  trial  and  error,  the  learning  element  revises  the  current  hypothesis 
in  response  to  the  data  contained  in  the  sample  from  the  instance  space.  It  generates 
these  hypotheses  using  a  learning  strategy. 

Hypotheses  are  evaluated  by  the  feedback  provided  to  the  learning  element.  This 
feedback  normally  consists  of  results  from  comparing  the  hypothesis  to  some  'oracle'. 
Finally,  effective  rule  assessment  by  the  learning  element  requires  transparency  of  the 
performance  element.  Transparency  in  this  case  refers  to  the  learning  element's  ability 
to  trace  the  actions  of  the  performance  element. 

3.3.3    Learning  Strategies 

There  are  several  basic  learning  paradigms  or  learning  strategies.    Of  particular 

interest  are  (Cohen  and  Feigenbaum  1982): 

Rote  Learning  is  simple  type  of  learning  similar  to  memorization.  One  retrieves 
stored  knowledge  from  the  system  when  necessary.  However,  storage 
space  can  become  a  problem  for  a  relatively  large  knowledge  base. 

Learning  by  Instruction  or  Being  Told  is  a  type  of  learning  requiring  a 
transformation  of  information  into  knowledge  structures  and  operations 


70 

suitable  for  use  by  the  machine.  The  problem  areas  involves  interpreting 
and  assimilating  both  system  requests  and  advice  into  a  machine-usable 
form.   One  must  also  integrate  the  knowledge  into  a  knowledge  base. 

Learning  from  Examples  is  a  form  of  learning  that  performs  inductive 
information  processing  on  a  set  of  examples  (a  training  set). 

Learning  by  Analogy  is  a  form  of  learning  consisting  of  two  steps.  The  first 
step  involves  building  a  knowledge  base.  The  second  step  is  an  analogical 
mapping  where  one  performs  deductive  inference  on  new  problems  based 
on  their  similarities  to  the  existing  knowledge  base. 

Learning  can  take  place  in  two  broad  settings:  'supervised'  or  'unsupervised'. 
In  supervised  learning,  a  teacher  (or  an  all-knowing  oracle)  is  present.  The  presence  of 
a  teacher  removes  ambiguities  from  the  training  set.  The  machine  can  then  learn  much 
more  rapidly  and  efficiently.  In  unsupervised  learning,  the  learning  system  has  no 
instructor  but  must  acquire  knowledge  on  its  own.  The  training  set  may  be  full  of 
ambiguities  which  the  learning  algorithm  must  resolve  on  its  own. 

The  four  strategies  for  learning  reflect  a  decreasing  reliance  of  supervision  and  an 
increasing  complexity  of  the  inference  process.  For  example,  in  rote  learning,  the  teacher 
direcdy  supplies  information.  No  inference  is  needed.  Learning  by  analogy  involves 
little  supervision  but  requires  a  complex  inference  capability. 

Within  each  general  strategy,  we  employ  different  inference  mechanisms  to 
varying  degrees.  TTie  main  inference  mechanisms  are  deduction  and  induction. 
Deduction  moves  from  general  truths  to  specific  cases  whereas  induction  moves  from 
specific  cases  to  generalizations. 

Deductive  information  processing  is  'truth  preserving'.  All  'truths'  classified  by 
the  deduced  information  are  implied  by  the  initial  information.   Hence,  new  information 


71 
'preserves'  the  facts  contained  in  old  information.    Deriving  specific  facts  from  general 

rules  or  developing  new  rules  from  old  ones  are  deductive  procedures. 

Inductive  information  processing  is  'falsity  preserving'.  The  induced  information 
correctly  categorizes  all  fallacies  contained  in  the  initial  knowledge.  Using  raw  data  or 
examples  to  establish  laws,  rules  or  general  patterns,  are  examples  of  inductive 
procedures. 

Learning  by  Instruction  and  Learning  from  Examples  are  (arguably)  the  two  most 
appropriate  strategies  for  knowledge  acquisition  aimed  at  expediting  the  construction  of 
knowledge-based  applications.  Learning  by  instruction  takes  two  common  forms.  In  one, 
there  is  a  computer-based  system  which  aids  an  expert  or  knowledge-engineer  in  building 
and  testing  a  knowledge-base.  TEIRESIAS  (Davis  1982)  was  the  first  illustration  of  this 
approach.  Others  are  discussed  in  "Knowledge  Acquisition  for  Knowledge-Based 
Systems:  Notes  on  the  State-of-the-Art"  (Boose  and  Gaines  1989).  The  focus  of  this  work 
is  on  machine  learning,  independent  of  interaction  with  an  expert. 

The  second  type  of  learning  by  instruction  is  called  Explanation-based 
Generalization  (Mitchell,  Keller,  and  Kedar-Cabelli  1986;  Dejong  and  Mooney  1986; 
O'Rorke  1989;  Flann  and  Dietterich  1989).  Here,  a  source  provides  initial  knowledge  that 
may  not  be  directly  usable.  One  then  uses  deduction  to  obtain  more  directly  applicable 
information.  The  deduction  provides  a  'proof  of  the  desired  goal  using  the  starting 
knowledge.  This  proof  becomes  an  explanation  that  can  be  generalized  to  more  directly 
usable  'compiled'  knowledge.   Hence,  Explanation-based  Generalization  is  a  useful  tool 


72 
for  machine-learning  in  an  area  where  a  formal  theory  or  wealth  of  deeper  knowledge 

may  exists. 

Learning  from  examples  is  an  induction  process.  The  two  extremes  of  learning 
from  examples  bracket  the  range  between  supervised  and  unsupervised  learning.  Under 
supervised  learning,  the  teacher  may  classify  the  training  set  into  disjoint  sets  of  examples 
of  a  concept.  Yet  another  form  of  supervised  learning  lets  the  learning  element  query  the 
teacher  to  determine  what  particular  examples  illustrate.  In  unsupervised  learning,  the 
learning  element  examines  the  training  set  to  discern  features  that  may  impact  the 
performance  element. 

The  remainder  of  this  work  focuses  on  learning  from  examples.  Furthermore,  we 
will  restrict  our  discussion  to  supervised  learning.  In  unsupervised  learning,  the  inference 
task  is  more  difficult,  although  there  are  many  available  methods,  which  include  neural 
nets  (Kohonen  1988),  cluster  analysis  (Cooley  and  Lohones  1971),  and  others  (Michalski 
and  Stepp  1983).  Work  in  supervised  learning  from  examples  has  focused  on  learning 
theories  and  learning  algorithms. 

3.3.4    Learning  Theories 

Over  the  past  decade  there  has  been  an  explosive  growth  in  the  theory  of  learning. 
Much  of  this  work  can  be  attributed  to  two  seminal  ideas:  Mitchell's  Version  Space 
(Mitchell  1982)  and  Valiant's  PAC  (Probably  approximately  correct)  learning  (Valiant 
1984).    Haussler  (Haussler  1988)  has  been  a  prolific  source  of  many  important  results. 


73 
Learning  theory  considers  three  aspects  of  concept  formation.   They  are  concept 

accuracy,  storage  efficiency,  and  computational  efficiency.   While  a  discourse  on  the 

state  of  machine-learning  theory  is  beyond  the  scope  of  this  dissertation,  an  acceptable 

learning  algorithm  must  operate  within  reasonable  storage  and  time  limitations  to  produce 

an  acceptably  accurate  concept.  'Reasonable'  usually  means  some  polynomially  bounded 

measure. 

Concept  accuracy  shows  'how  well'  the  system  leams.  This  is  the  percentage  of 
instances  correctly  classified  by  the  learned  concept.  This  measure  is  well  suited  for 
description  or  classification  tasks.  However,  it  may  not  be  appropriate  for  a  pattern- 
matching  task. 

Storage  efficiency  indicates  'how  costly'  is  it  for  the  system  to  learn.  Memory 
is  a  resource  that  system  developers  must  manage  well.  Thus,  superior  memory 
management  for  a  given  task  demonstrates  improved  performance. 

Computational  efficiency  reveals  'how  long'  the  system  takes  to  learn.  A 
desirable  property  for  any  application  is  having  the  computational  process  by  which  the 
machine  leams  be  a  small  number  of  steps.  This  is  normally  a  relative  measure.  Two 
algorithms  exhibit  similar  performance  if  both  learn  using  a  number  of  computational 
steps  on  the  same  order  of  magnitude.  On  the  other  hand,  a  single  algorithm  performs 
poorly  if  its  computation  time  is  some  exponential  function  of  a  combination  of  its  inputs. 


74 
3.3.5    Learning  Algorithms 

A  learning  algorithm  inputs  a  sample  and  outputs  a  concept  or  a  'FAIL'  message. 
Of  course,  the  hope  is  that  the  output  concept  is  'close'  to  the  true  (target)  concept. 


CONSIDERS  TIONS  RELA  TED 
TO  LEARNINa  JKLaORITHMS 

IIIIIIIIIIIIIIIIIIMIIIIIIIHIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII 

representation  of  learned  concepts 

incremental  or  non-increivlental  mode 

dealino  'with  uncertainty 

single-concept  or  multiple -concepts 

algorithm:'s  search  strategy 

GMDAL  for  CONCEPT  FORMATION 

APPLICATION'S  DOMAIN 

USEFULNESS  CRITERION 


Figure  3.2    Learning  algorithm  considerations 


There  are  many  variations  that  a  given  algorithm  may  consider.  Figure  3.2  shows 
eight   such   considerations.      In    this   section    we   list   and   discuss    several    of  these 
considerations. 
3.3.5. 1    Representation  of  learned  concepts 

In  AI,  learning  is  viewed  as  concept  formation.    In  its  simplest  view,  a  concept 
is  an  equivalence  class  of  instances.    The  description  of  the  class  in  some  language 


75 
provides  the  class  description.    Each  class  contains  the  subset  of  instances  which  are 

members  of  the  class.   In  supervised  learning,  the  teacher  or  oracle  identifies  the  correct 

class  for  each  member.   For  example,  suppose  we  had  a  collection  of  geometric  figures 

(e.g.  squares,  rectangles,  right  triangles,  isosceles  triangles,  lines,  circles,  ellipses,  dots, 

spheres,  cubes,  pyramids. ...).  This  instance  space,  (the  collection  of  figures),  can  be  said 

to  contain  the  object-classes  of  rectangular  objects,  triangular  objects,  circular  objects,  and 

linear  objects.    A  more  general  object-class  for  this  same  instance  space  is  polygon. 

Squares  and  pyramids  are  'positive'  instances  of  a  polygon  whereas  circles  and  lines 

represent  'negative'  instances  of  a  polygon.    A  possible  concept  for  this  polygon  class 

might  be  'number  of  sides  >  2'.  For  this  case  the  representation  language  must  detail  the 

'side'  object  for  a  polygon  and  the  operation  of  'counting'  the  sides. 

As  another  example,  consider  the  concept  of  'a  person  who  is  a  poor  credit  risk'. 

This  can  be  the  set  of  all  people  who  will  default  on  a  loan.  We  might  describe  this  class 

using  some  language  such  as: 

-People  who  are  unemployed 

-People  who  are  heavily  indebted  relative  to  their 

income  source 
-Younger  people  who  have  a  pohce  record 

3.3.5.2    Incremental  and  Non-Incremental  Learning 

Another  consideration  is  whether  the  algorithm  will  operate  in  an  'incremental' 

or  'non-incremental'  learning  mode.    Incremental  learning  algorithms  revise  the  current 

hypothesis  after  sequentially  examining  each  instance  in  the  training  set.    Re-examining 

any  previous  instance  is  not  possible.    The  structure  of  the  training  set  and  the  order  in 

which  the  instances  arrive  are  factors  which  impact  the  efficiency  of  the  algorithms.  Non- 


76 
incremental  learning  algorithms  examine  the  entire  training  set  as  a  whole  for  creating 

concepts.   They  re-examine  the  training  instances  as  many  times  as  needed  to  revise  the 

hypothesis. 

3.3.5.3  Dealing  with  uncertainty 

An  important  issue  related  to  an  algorithm  is  its  ability  to  handle  uncertainty. 
Uncertainty  has  two  sources—residual  variation  and  noise.  Unrecorded  extraneous  factors 
which  affect  the  results  represent  residual  variation.  Conflicts  in  the  training  set, 
misclassification  errors  and  measurement  errors  are  all  examples  of  noise.  The  presence 
of  noise  increases  the  computational  complexity  of  many  learning  algorithms. 

3.3.5.4  Learning  single  or  multiple  concepts 

Recall  the  earlier  example  involving  geometric  figures.  Finding  a  concept  to 
determine  the  'polygon'  class  represents  single-concept  learning.  The  single  concept— 
(number  of  sides  >  2)--classifies  aH  of  the  figures  into  positive  or  negative  examples  of 
a  polygon.  Finding  concepts  to  identify  the  classes  of  rectangular  objects,  triangular 
objects,  circular  object,  linear  objects  etc.,  are  the  focus  of  multiple-concept  learning 
algorithms. 

3.3.5.5  Algorithm's  search  strategy 

There  are  a  lot  of  approaches  to  implementing  an  algorithm.  However,  inductive 
learning  usually  involves  a  search  strategy.  Using  a  good  search  method  improves  the 
chances  of  getting  a  solution  quickly.  The  most  important  issue  associated  with  search 
is  the  space  to  be  searched-the  'state  space'.    This  is  the  set  of  ail  possible  concepts. 


77 
A  'state'  represents  a  discrete  'examination  point'.  The  state  space  contains  tiie  set  of  all 

possible  states  along  with  implied  states  using  various  defined  operators. 

Operators  represent  ways  to  'move'  to  successor  states,  or  general  rules  of 
inference  to  create  new  assertions  from  existing  ones.  A  third  issue  related  to  search  is 
the  'control  strategy'  used  for  guiding  the  search.  This  is  simply  the  search  strategy. 
Several  general  methods  are  available  for  searching.  One  can  also  classify  learning 
algorithms  according  to  the  searching  technique  they  use. 

Data-driven  methods  use  the  training  instances  (data)  to  drive  the  search.  These 
algorithms  simply  specify  the  search  order  for  examining  the  search  space.  Depth-first 
and  Breadth-first  are  two  common  data-driven  search  techniques.  These  are  'blind'  search 
techniques  because  they  use  exhaustive  approaches  and  will  not  see  the  goal  until  they 
get  there. 

Model-driven  methods  use  an  a  priori  model  to  guide  its  search.    The  designer 
incorporates  background  or  domain  knowledge  in  the  model  to  increase  the  efficiency  of 
the  search.  One  uses  knowledge  of  the  goal  state  to  eliminate  certain  areas  of  the  search 
space  for  examination.   The  knowledge  may  be  heuristic  or  definitive. 
3.3.5.6   Concept  formation  goals 

The  goal  behind  the  specific  desire  to  learn  a  concept  may  be  a  consideration. 
Four  common  goals  are  classification,  description,  pattern  matching  and  prediction. 
Researchers  commonly  use  one  of  these  four  tasks  to  develop  new  techniques  and 
examine  variations  of  existing  techniques.  Also,  intellectual  processes  such  as  learning 
and  reasoning  inherently  involve  these  same  goals. 


78 
Classification  is  a  standard  goal.    Given  a  new  or  unknown  object,  the  system 

must  classify  the  object  into  a  positive  or  negative  instance  of  a  learned  concept. 

Description  is  a  core  task  for  many  learning  algorithms.  Given  a  set  of  positive 
and  negative  instances  of  some  concept,  the  system  must  learn  the  concept  that  describes 
all  of  the  positive  examples.  Recall  that  a  learned  concept,  'number  of  sides  >  2\ 
described  a  set  of  geometric  figures  consisting  of  rectangles,  squares,  cubes,  pyramids, 
etc.,  and  not  circles,  spheres,  cones,  etc. 

Pattern  Matching  is  a  task  commonly  found  in  the  'adaptive'  systems  developed 
for  many  engineering  applications.  These  systems  are  adaptive  because  of  the  dynamic 
or  constantly  changing  domain-characteristics  associated  with  them.  Given  an  input 
object,  the  system  must  first  establish  a  pattern  for  the  object.  Next,  the  system 
determines  whether  the  object's  pattern  matches  any  of  the  existing  patterns  currently 
stored  in  the  knowledge  base.  For  example,  suppose  we  want  a  system  that  translates  a 
handwritten  sentence  from  English  to  Japanese.  One  of  the  first  task  to  perform  is 
identifying  each  letter  and/or  word  of  the  sentence.  Recall  that  most  people  have  a 
unique  handwriting  style.  Hence,  the  appearance  of  any  given  handwritten  letter  or  word 
varies  from  person  to  person.  However,  everyone  forms  their  letters  according  to  some 
standard  template--the  alphabets.  The  input  to  the  system  is  the  handwritten  letter.  The 
output  is  the  correctly  identified  alphabet  the  handwritten  letter  represents.  The  task  is 
to  map  the  input  to  the  proper  output. 

Prediction  is  another  traditional  goal.  Using  a  learned  concept  and  a  sequence 
of  instances,  the  system  must  predicts  other  likely  examples  of  the  sequence. 


79 

3.3.5.7    Application  domain 

Many  AI  applications  show  potential  payoffs  for  developing  and  using  intelligent 
computer  programs.  AI  practitioners  do  not  develop  programs  that  act  'just  like  a 
human'.  Instead,  they  develop  programs  which  incorporate  their  understanding  of  the 
intellectual  process  to  solve  problems.  Performing  this  process  is  laborious,  even  by  the 
experts,  in  domains  where  the  knowledge  is  incomplete  and  not  well  defined.  Choosing 
the  conceptual  primitives  for  these  types  of  domains  is  a  difficult  process.  On  the  other 
hand,  there  are  other  domains  for  which  large  amounts  of  knowledge  exist.  Learning, 
thinking  and  reasoning  in  these  domains  are  straightforward  processes  and  the  conceptual 
primitives  are  easily  specified. 

AI  practitioners  also  focus  on  constructing  knowledge  bases  that  are  domain 
independent.  This  means  that  a  low  level  of  coupling  exists  between  the  performance 
element  and  the  environment.  This  makes  it  possible  to  'interchange'  knowledge  bases 
without  making  major  changes  to  the  system  design.  One  can  then  study  the  performance 
of  a  given  learning  paradigm  using  several  representation  schemes.  A  key  to  achieving 
domain  independence  is  separating  the  domain-specific  knowledge  from  the  knowledge 
representation.  The  representation  determines  the  inferences,  relations,  and  computational 
objects  available  to  the  machine.  The  language  also  affects  the  process  of  acquiring  and 
organizing  the  knowledge  in  the  knowledge  base. 

To  help  separate  the  representation  scheme  from  the  application  medium,  Davis 
(1982)  stratifies  knowledge  into  three  distinct  levels.  The  first  level  contains  object  level 
knowledge.   This  level  focuses  on  knowledge  about  objects  in  the  appUcation's  domain. 


80 
The  next  level  concerns  the  conceptual  building  blocks  of  the  knowledge  representations. 

It  details  the  various  tools  and  techniques  for  acquiring  and  manipulating  the  knowledge. 

The  third  level  describes  the  conceptual  primitives  behind  representations  in  general.  This 

represents   ' meta-knowledge'  or  knowledge   about  the  objects  and   structures  of  the 

representation  language  itself. 

3.3.5.8    Criterion 

Another  consideration  necessary  to  choose  or  develop  a  learning  algorithm  is  to 
elaborate  the  measure  of  usefulness  to  be  employed.  In  other  words,  at  what  point  will 
the  algorithm's  output  be  useful  enough  to  adequately  solve  a  problem.  For  example,  in 
PAC  learning,  one  tries  to  determine  a  concept  that  has  probability  no  greater  than  5  of 
an  error  e.  In  neural  net  learning,  a  common  criterion  is  to  minimize  the  squared- sum-of- 
errors  between  the  target  outputs  and  the  network's  learned  outputs. 

Of  these  eight  considerations,  representation  of  learned  concepts,  search  strategies, 
and  usefulness  criterion  are  of  special  interest  for  this  work.  The  next  sections  describe 
concept  description  languages  and  briefly  reviews  a  standard  algorithm  used  in  our 
research. 

3.4   Concept  Description  Languages 

The  language  chosen  to  represent  the  concept  is  critical  to  the  learning  process. 
The  language  may  not  be  expressive  enough  to  exactly  capture  the  concept.  Or,  it  may 
be  too  complicated  for  a  human  to  understand  or  use. 


81 

One  taxonomy  of  learning  may  focus  on  the  concept  description  language.   Some 

common  examples  of  concept  descriptions  include: 

1 .  Linear  equations, 

2.  Non-linear  equations, 

-Polynomials 
-Neural  nets 

3.  Decision  Trees, 

4.  Conjunctive  equations, 

5.  Disjunctive  equations, 

6.  K-DNF  equations, 

7.  K-CNF  equations,  and 

8.  Decision  Lists  (Rivest  1987). 

Concept  formation  is  integrally  tied  to  the  desired  representation  of  the  knowledge-base. 

Restrictions  placed  on  the  language  for  representing  learned  concepts  achieve  a 

balance  between  the  expressiveness  and  efficiency  of  the  system.    For  example,  most 

people  prefer  simpler  explanations  to  complex  ones,  whereas  highly  efficient  routines  tend 

to  produce  complex  expressions.    This  'bias'  limits  the  ability  of  the  program  to  learn 

only  the  required  concepts,  but  it  also  increases  the  efficiency  of  the  search.  Considering 

inductive  learning,  Haussler  (1988)  states: 

The  most  prevalent  form  of  inductive  bias  is  the  restriction  of  the 
hypothesis  space  to  only  concepts  that  can  be  expressed  in  some  limited 
concept  description  language,  e.g.  concepts  described  by  logical 
expressions  involving  only  conjunctions.  A  still  stronger  bias  can  be 
obtained  by  also  introducing  an  a  priori  preference  ordering  on  hypotheses, 
e.g.  by  preferring  hypotheses  that  have  shorter  descriptions  in  the  given 
description  language.    (178) 


82 

There  are  a  number  of  different  research  issues  in  knowledge  representation.  They 
range  from  the  semantics  of  the  language  itself,  to  the  kinds  of  hierarchies  and 
inheritances  supported  by  the  scheme.  Additionally,  there  are  issues  concerning  the  kinds 
of  knowledge  and  how  to  distinguish  them.  Following  is  a  brief  discussion  of  several  of 
these  key  issues. 

Barr  and  Feigenbaum  (1981)  list  four  kinds  of  knowledge  represented  in  AI 
systems.  They  are  object  knowledge,  event  knowledge,  performance  knowledge  and 
meta-knowledge.  Object  knowledge  represents  facts  about  objects  in  the  world  around 
us.  Event  knowledge  indicates  what  we  know  about  actions  and  events  in  the  world.  For 
example,  a  well  known  event  is  that  the  sun  rises  in  the  morning.  Performance 
knowledge  consists  of  knowledge  about  how  to  do  things.  Finally,  meta-knowledge 
pertains  to  knowledge  about  knowledge. 

Additionally,  the  authors  introduce  three  characteristics  useful  for  comparing 
different  representation  schemes— scope,  understandability,  and  modularitv-  Scope  refers 
to  the  amount  of  information,  or  level  of  detail,  used  to  describe  objects  and  events. 
Understandability  relates  to  how  well  humans  comprehend  the  information  in  its  present 
form  (i.e.,  the  data  structure).  This  is  important  not  only  for  acquiring  knowledge  from 
the  experts  but  also  for  interacting  with  and  giving  explanations  to  the  users.  Modularity 
concerns  itself  with  the  degree  of  autonomy  for  adding,  deleting,  or  modifying  individual 
chunks  of  knowledge  in  the  system.  The  amount  of  interaction  between  the  various 
database  entries  depends  on  the  representation  scheme  and  data  structure  used. 


83 
This  research  focuses  on  using  Decision  Trees  and  Decision  Lists  as  concept 

description  languages.     Specifically,  we  explore  learning  DNF  concepts  using  binary 

decision  trees,  and  its  extension  to  representing  the  concept  as  a  decision  list.  We  desire 

to  develop  a  framework  for  feature  construction,  or  methods  of  enlarging  the  set  of 

primitive  attributes  with  additional  attributes,  or  features,  that  we  construct  using 

combinations  of  the  primitives.    The  approach  uses  ideas  from  the  fields  of  machine 

learning,  pattern  recognition,  and,  category  theory.   The  followmg  two  sections  describe 

the  notation  and/or  terms  used  in  the  sequel. 

3.4.1    Binary  Trees 

We  begin  this  section  with  a  brief  overview  of  terms  and  properties  associated 
with  binary  trees.  A  research  goal  is  to  find  equivalent  binary  trees,  (i.e.,  trees  giving  the 
same  decision),  that  generally  satisfy  some  size  optimality  rule.  A  complete  discussion 
of  the  following  concepts  is  found  in  Sedgewick  (1992)  and  Safavian  et  al.  (1991). 

A  vertex  is  a  simple  object,  (or  node),  having  a  name  or  label.  Two  vertices  are 
connected  by  an  edge.  A  nonempty  collection  of  vertices  and  edges  satisfying  various 
requirements  is  a  tree.  A  list  of  distinct  vertices,  where  successive  vertices  are  connected 
by  edges,  describes  a  path  in  the  tree.  The  defining  property  of  a  tree  is  that  there  exist 
exacdy  one  path  between  any  two  nodes  in  the  tree.  We  refer  to  a  tree  by  its  one 
designated  root  node.  Hence,  any  node  in  the  tree  defines  a  subtree  consisting  of  the 
node— as  a  root  node-and  the  nodes  below  it. 


84 
With  the  exception  of  the  root  node,  each  node  in  the  tree  has  exactly  one  parent, 

or  node  above  it,  and,  zero  or  more  children,  or  nodes  directly  below  it.    If  the  order  of 

the  children  is  specified  for  the  nodes,  then  we  have  an  ordered  tree.  Leaves,  or  terminal 

nodes,  are  nodes  without  any  children.     Nonterminal  nodes  have  at  least  one  child. 

Additionally,  terminal  nodes  and  nonterminal  nodes  are  sometimes  called  external  nodes 

and  internal  nodes.  Each  external  node  and  internal  node  has  an  associated  external  path 

length  and  internal  path  length.    The  external/internal  path  length  is  the  sum  of  the 

lengths  for  the  paths  from  each  external/internal  node  to  the  root. 

The  number  of  nodes  on  the  path  from  any  node  in  the  tree  to  the  root,  (excluding 
the  node),  defines  the  level  of  the  node.  The  maximum  level  among  all  nodes  in  the  tree 
defines  the  height  of  the  tree. 

A  binary  tree  is  an  ordered  tree  having  both  internal  and  external  nodes  such  that 
every  internal  node  has  exactly  two  children.  Furthermore,  the  terms  left  child  and  right 
child  refer  to  the  ordered  children  of  an  internal  node.  An  empty  binary  tree  has  one 
external  node  and  no  internal  nodes.  A  binary  tree  where  internal  nodes  completely  fill 
every  level,  except  for  possibly  the  last,  is  called  a.  full  binary  tree.  A  complete  binary 
tree  is  a  full  binary  tree  when  only  external  nodes  appear  at  the  two  greatest  levels. 
Furthermore,  all  nodes  on  the  maximum  level  appear  to  the  left.  Note  that  the  external 
nodes  of  a  binary  tree  only  serve  as  placeholders.  The  major  focus  of  constructing  binary 
trees  is  to  'structure'  the  internal  nodes  according  to  some  scheme. 

Finally,  we  have  the  following  well-defined  properties  associated  with  trees. 


85 
PROPERTIES  [Tree  Properties,  Sedgewick  (1992)] 

4.1  There  is  exactly  one  path  connecting  any  two  nodes  in  a  tree. 

4.2  A  tree  with  N  nodes  has  N  -  1  edges. 

4.3  A  binary  tree  with  N  internal  nodes  has  N  +  ]  external  nodes. 

4.4  The  external  path  length  of  any  binary  tree  with  N  internal  nodes  is 
2N  greater  than  the  internal  path  length.    (38-39) 

5     The  height  of  a  full  binary  tree  with  N  internal  nodes  is  about  logJ^. 
3.4.2   Decision  Trees 

This  research  focuses  on  representing  Boolean  functions  using  binary  decision 
trees.  A  Boolean  function  is  a  function  of  Boolean  variables  (i.e.,  elements  of  the  set 
{0,1 }),  or  literals  (a  boolean  variable  or  its  negation).  A  binary  decision  tree  consists  of 
both  internal  and  external  nodes.  Each  internal  node  represents  a  comparison  of  two 
objects  and  has  a  edge  for  each  outcome.  Each  external  node,  or  leaf,  represents  a  result. 
Pagallo  (1990)  shows  that  every  Boolean  function  has  a  Decision  Tree  or  Disjunctive 
Normal  Form  (DNF)  representation.  Other  results  of  using  Decision  Trees,  or  DNfF 
equations,  are  found  in  Rivest  (1987)  and  Haussler  (1988). 

Next,  we  develop  a  practical  approach  we  need  for  examining  both  decision  trees 

and  decision  lists.  We  use  the  formal  definitions  of  decision  trees  and  the  functions  they 

represent  as  found  in  Ehrenfeucht  and  Haussler  (1989).  Hence,  let  us  adopt  the  following 

definitions. 

Let  V„  =  |V/ vj  be  a  set  of  /;  Boolean  variables.    Let  X„  = 

{0,1 }°.   The  class  T„  of  decision  trees  (over  V„)  is  defined  recursively  as 
follows: 


86 

(i)  If  Q  is  the  binary  tree  consisting  of  only  a  root  node  labeled  either  0  or  1 
then  Q  E  T„.  (Henceforth  we  will  abbreviate  this  case  by  simply  saying 
"Q=0"  or  "Q=l.") 

(ii)  If  Qo,  Q,  e  T„  and  v  e  V„,  then  the  binary  tree  with  root  labeled  v,  left 
subtree  Qo»  and  right  subtree  Q,  is  in  T„.  (Henceforth  we  will  refer  to  the 
left  subtree  as  the  0-subtree  and  the  right  subtree  as  the  l-subrree.) 

A  reduced  decision  tree  has  each  variable  appearing  at  most  once  in  any  path  from 
the  root  to  a  leaf.  Positive  and  negative  leaves  correspond  to  1  and  0  leaves  respectively. 
Positive  paths  lead  to  positive  leaves  while  negative  paths  lead  to  negative  leaves. 

The  rank  of  a  decision  tree  Q,  r(Q),  is  defined  as  follows: 

(i)     If  Q  =  0  or  Q  =  1  then  r(Q)  =  0. 

(ii)   Else  if  Vq  is  the  rank  of  the  0-subtree  of  Q  and  r,  the  rank  of 
the  1 -subtree,  then 


max(ro,r,) 

if  ro  ^  ri 

r(Q)  = 

ro+  1    (=r,  +  1) 

otherwise 

We  describe  a  decision  tree  Q  e  T„  representing  a  Boolean  function /^  as: 

(i)         If  Q=0  then/^  is  the  constant  function  0  and  if  Q=l  then/^  is  the  constant 
function  1. 

(ii)        Else  if  v,  is  the  label  of  the  root  of  Q,  Qf,  the  0-subtree  of  Q  and  Q,  the  1- 

subtree,  then  for  any  point  x  =  (a, a„)  e  {0,1}°,  if  a~0  then /g(;c)  = 

fgo(x),  else/pfA)  =fQ,(x). 


3.4.3    Decision  Lists 


Another  powerful  representation  for  concepts  are  decision  lists  (Pagallo  and 
Haussler  1990;  Rivest  1987).    Formally,  a  decision  list  is  defined  as  a  list  of  pairs  L  = 


87 
(fj,a,), ...,  (f„a,)  such  that:  (1)  1  <  /  <  5;  (2)  a,  e  {0,1 };  (3)/  is  a  Boolean  function  defined 
by  a  conjunction  of  up  to  k  literals  for  some  fixed  k;  and,  (4)  the  last  function/,  is  always 
the  function  true.  The  Boolean  function  represented  by  L  is  defined  by  letting  L(x)  =  a^, 
where  j  is  the  least  index  such  that/j(x)  =  1  (Rivest  1987;  Ehrenfeucht  and  Haussler 
1989).   Additionally,  this  class  of  functions  includes  both  /:-DNF  and  A-CNF  functions. 

Rivest  (1987)  shows  how  decision  lists  allow  for  a  more  complex  decision  at  each 
node.  Also,  decision  lists  have  a  simpler  structure  relative  to  decision  trees,  and  certain 
types  are  learnable  in  the  sense  of  Valiant  (Rivest  1987;  Ehrenfeucht  and  Haussler  1989; 
Pagallo  and  Haussler  1990). 

Finally,  Ehrenfeucht  and  Haussler  (1989)  make  a  transformation  where  decision 
lists  are  a  class  T  of  rank  1  decision  trees  defined  recursively  as: 

(i)   A  single  leaf  labeled  either  0  or  1  is  in  T 

(ii)  If  Qo  is  a  decision  tree  in  T,  Q,  is  a  leaf  labeled  either  0  or  1,  and  v  is  a 
variable,  then  the  decision  tree  with  root  labeled  v,  left  subtree  Qo,  and 
right  subtree  Q,,  AND  the  similar  tree  with  left  subtree  Q,  and  right 
subtree  Qo,  are  both  in  T. 

Regarding  their  transformation,  the  authors  suggest  that  by  simply  inventing  a  new 

variable  for  every  Boolean  function  represented  by  a  conjunction  of  up  to  k  literals,  we 

can,  using  this  new  set  of  variables,  represent  any  decision  list  in  normal  form  in  which 

each  f,  is  a  single  literal. 

3.5   C4.5  Machine  Learning  Programs 

C4.5  is  a  set  of  programs  developed  by  J.  Ross  Quinlan  (Quinlan  1993).  This  set 
of  programs  is  a  descendant  of  ID3— a  well  documented  decision-tree-algorithm  in  the 


88 
machine-learning  literature.      Essentially,  C4.5  examines  a  sample  and   inductively 

constructs  a  decision  tree  by  generalizing  from  specific  examples.  There  are  several  key 

requirements  for  C4.5.    First  of  all,  each  'case'  or  example  is  expressed  in  terms  of  a 

fixed  set  of  attributes  which  may  have  numeric,  discrete,  or  even  unknown  values. 

Secondly,  C4.5  performs  'supervised  learning',  hence,  the  classes  or  categories  must  be 

predefined.   Each  case  is  assigned  to  one  and  only  one  class.   A  third  requirement  is  that 

the  categories  must  be  mutually  exclusive,  and,  that  there  be  far  more  cases  than  classes. 

Having  a  sufficient  amount  of  data  is  a  necessary  requirement  for  effectively  learning  a 

concept.   A  complete  description  of  the  programs  is  found  in  Quinlan  (1993). 

C4.5  builds  a  classifier  in  the  form  of  a  decision  tree  using  nominal,  continuous, 
and  even  missing  attribute  values.  The  program  finds  appropriate  thresholds  against 
which  to  compare  the  values  of  continuous  attributes  and  unknown  test  outcomes  are 
distributed  probabilistically  according  to  the  relative  frequency  of  known  outcomes.  C4.5 
also  uses  'pruning'  methods  to  produce  simpler  decision  trees.  Essentially,  'pruning' 
methods  simplify  a  tree  by  discarding  one  or  more  of  the  subtrees  and  replacing  them 
with  leaves.  The  program  also  lists  various  parameters  associated  with  the  decision  tree 
it  creates  including  (1)  the  misclassification  rate,  (2)  an  estimation  of  the  misclassification 
rate  for  'unseen'  cases,  and,  (3)  the  size  or  number  of  nodes. 

C4.5  uses  the  'gain-ratio'  criterion  to  evaluate  the  potential  information  generated 
by  partitioning  a  sample  on  a  given  attribute.  This  measiu^e  is  given  by  dividing  the 
'information  gain'  by  the  'split  information'.  The  'split  information'  denotes  the  potential 
information  generated  by  partitioning  a  set  using  some  test-attribute.    The  'information 


89 
gain'  measures  the  information  relevant  to  classification  that  arises  from  partitioning  the 

set.  The  'gain-ratio'  then  represents  the  proportion  of  information  generated  by  the  split 

that  appears  helpful  for  classification.    Quinlan  (1993)  writes: 

In  my  experience,  the  gain  ratio  criterion  is  robust  and  typically  gives  a 
consistentiy  better  choice  of  test  than  the  gain  criterion.  It  even  appears 
advantageous  when  all  tests  are  binary,  but  differ  in  the  proportions  of  cases 
associated  with  the  two  outcomes.  However,  Mingers  (1989)  compares  several 
test  selection  criteria  and,  while  he  finds  that  gain  ratio  leads  to  smaller  trees, 
expresses  reservations  about  its  tendency  to  favor  unbalanced  splits  in  which  one 
subset  is  much  smaller  than  the  others.    (24) 

Lopez  De  Mantaras  (1991)  describes  two  additional  drawbacks  for  the  gain-ratio  measure: 

(1)  it  is  undefined  for  situations  when  there  is  no  'split  information'  (i.e.,  the  denominator 

is  zero),  and,  (2)  it  may  choose  test-attributes  having  a  very  small  amount  of  'split 

information'  instead  of  choosing  one  with  high  "information  gain'. 

Using  the  gain  ratio  criterion,  C4.5  recursively  partitions  the  sample  until  each 
subset  in  the  partition  contains  cases  of  a  single  class,  or,  until  there  are  a  'minimum' 
number  of  cases  for  the  split.  The  minimum  number  can  be  chosen  by  the  user  and  we 
use  the  default  value  of  two  in  this  work. 

Finally,  C4.5  also  contains  programs  that  allows  users  to  perform  such  tasks  like, 
for  example,  forming  production  rules,  and,  grouping  attribute  values.  Another  highUght 
of  C4.5  is  that  it  allows  users  to  interactively  enter  a  case  and  receive  a  class  assignment 
for  it.   In  this  dissertation  we  use  C4.5  solely  for  building  decision  trees. 


CHAPTER  4 
FEATURE  CONSTRUCTION 


4. 1    Decision  Trees  and  Feature  Construction 

Many  of  the  current  learning  methods  using  decision  trees  as  a  concept  description 
language  represent  the  concept  using  DNF  expressions.  Most  methods  require  a  set  of 
prime  attributes  and  a  sample  of  a  target  concept.  Matheus  (1989)  gives  a  comparative 
analysis  of  several  learning  methods  that  also  incorporate  some  form  of  feature 
construction.  Feature  construction  is  a  technique  for  creating  new  features  which  are 
combinations  of  a  set  of  primitive  attributes.  This  is  a  type  of  representation  change, 
where  each  term,  or  feature,  in  the  concept  description  is  a  function  of  the  prime 
attributes.  In  this  way,  for  example,  we  can  'invent'  a  variable  for  each  Boolean  function 
represented  by  a  conjunction  of  up  to  k  literals.  A  literal  is  simply  a  variable/attribute 
or  its  negation. 

Claiming  that  feature  construction  is  a  'powerful'  tool  for  increasing  both  the 

accuracy  and  understanding  of  structure.  Brieman  et  al.  (1984)  give  two  reasons  for 

constructing  and  using  features: 

1.  The  original  data  are  high  dimensional  with  little  usable  information  in  any 

one  of  the  individual  coordinates.  An  attempt  is  made  to  "concentrate"  the 
information  by  replacing  a  large  number  of  the  original  variables  by  a 
smaller  number  of  features. 


90 


91 

2.  On  examination,  the  structure  of  the  data  appears  to  have  certain  properties 

that  can  be  more  sharply  seen  through  the  values  of  appropriate  features. 
(138) 

The  authors  form  features  using  various  statistically  based  procedures  such  as,  for 

example,  linear  combinations.  Boolean  combinations,  and,  ad-hoc  combinations. 

Considering  AI  approaches,  Pagallo  (1990)  shows  that  when  the  tests  at  the 
decision  nodes  are  limited  to  single  attributes,  concepts  with  small  DNF  descriptions  do 
not  always  have  a  concise  decision  tree  representation  because  the  tests  to  determine  if 
an  instance  satisfies  a  term  have  to  be  replicated  in  the  tree.  Also,  given  a  decision  tree 
constructed  using  a  set  of  primitive  attributes,  some  form  of  feature  construction  is 
necessary  in  order  to  examine  key  results  of  Ehrenfeucht  and  Haussler  (1989)  related  to 
finding  consistent  decision  trees  of  minimum  rank. 

Ehrenfeucht  and  Haussler  (1989)  present  a  formal  framework  for  learning  decision 
trees  from  random  examples.  Claiming  that  their  algorithm  produces  decision  trees 
having  a  minimal  number  of  nodes,  the  authors  use  /^crmA-^5--instead  of  the  primitive 
attributes-in  their  sets  of  attributes  used  to  build  decision  trees.  A  key  task  of  the 
authors'  algorithm  is  finding  a  decision  tree  having  a  rank  equal  to  one.  Recall  from 
Chapter  1  that  decision  trees  with  a  rank  of  one  depict  a  concise  representation  of  a  target 
concept.  Given  a  set  of  /;  primitive  attributes,  the  authors'  approach  calls  for  using  a  new 
attribute-set  which  contains  the  2"  features  to  build  a  decision  tree.  In  many  cases  this 
may  not  be  practical  because  of,  for  example,  the  computational  effort  required  to  form 
a  large  number  of  features  using  a  fairly  large  set  of  primitive  attributes.    Note  that 


92 
Ehrenfeucht  and  Haussler's  (1989)  method  differs  from  existing  approaciies  for  building 

decision  trees  that  use  only  primitive  attributes. 

The  problem  of  constructing  decision   trees  generally  consist  of  three  tasks 

(Safavian  and  Landgrebe  1991):  (1)  an  appropriate  choice  of  a  tree  structure;  (2)  denoting 

the  feature  subsets  for  use  at  each  internal  node;  and,  (3)  choosing  a  decision  rule  or 

strategy  to  use  at  each  internal  node.   Many  researchers  focus  on  the  third  task  and  base 

their  strategies  on  just  the  primitive  attributes.    Ehrenfeucht  and  Haussler  (1989)  also 

focus  on  the  third  task  using  the  primitives  and  every  possible  feature  formed  using  the 

primitives.  This  research  focuses  on  the  second  task  and  explores  approaches  for  creating 

feature  subsets,  using  a  set  of  primitive  attributes,  that  are  likely  to  be  used  as  nodes  in 

a  tree. 

4.2    Feature-Construction  Algorithms 

Focusing  on  the  problem  of  forming  features,  Matheus  (1989)  and  Pagallo  (1990) 
each  developed  an  approach  to  feature  construction  that  does  not  allow  an  almost 
unlimited  introduction  of  features.  Both  approaches  require  a  binary  decision  tree  to 
perform  feature  construction.  This  requirement  limits  the  number  of  features  that  may 
be  formed  since,  in  this  case,  we  form  features  using  only  the  primitive  attributes  found 
in  the  input  decision  tree.  Note  that  in  many  cases,  all  of  the  attributes  used  for  building 
a  decision  tree  are  not  used  in  the  tree.  Also  for  these  approaches,  features  are  formed 
using  attributes  in  a  tree  that  are  chosen  using  various  heuristics  that  select  them  based 
on  their  relative  positions  in  the  tree. 


93 
Each  of  the  author's  method  is  an  iterative  process  requiring  a  new  decision  tree 

at  interim  steps.    Hence,  these  approaches  generally  call  for  a  large  amount  of  time  to 

form  a  finite  number  of  features.   Before  describing  these  methods  we  first  discuss  how 

we  analyze  the  space  and  time  requirements  of  feature  construction  algorithms. 

4.2.1    Complexity  Measures 

Analysis  of  the  time  of  execution  and/or  the  storage  required  by  a  program  helps 
pinpoint  parts  of  a  program  whose  efficiency  may  be  improved.  We  analyze  how  these 
requirements  increase  as  the  amount  or  size  of  the  data  increases  using  'asymptotic 
analysis'.  This  type  of  analysis  gives  bounds  on  the  amount  of  storage  and  time  required. 
The  amount  of  storage  available  is  determined  by  a  computer  system,  however,  we 
endeavor  to  find  efficient  programs  that  only  require  a  reasonable  amount  of  storage 
space. 

When  constructing  decision  trees,  tradeoffs  are  usually  made  between  (1) 
computational  efficiency,  (2)  accuracy  of  classification,  and,  (3)  storage  space 
requirements.  The  average  number  of  layers  from  the  root  to  the  terminal  nodes,  (i.e.. 
average  depth),  reflect  the  weight  given  to  efficiency.  On  the  other  hand,  the  average 
number  of  internal  nodes  in  each  level  of  the  tree,  (i.e.,  average  breadth),  reflect  the 
relative  weight  given  to  classifier  accuracy  (Safavian  and  Landgrebe  1991). 

We  use  a  complexity  measure  to  study  the  amount  of  time  required  for  learning 
algorithms  to  produce  an  acceptably  accurate  concept.  The  sample  complexity  of  a 
learning  algorithm  indicates  the  number  of  examples  required  by  the  algorithm,  as  a 


94 
function  of  the  various  input  parameters.    The  time  complexity  of  a  learning  algorithm 

reflects  the  amount  of  time  required  by  the  algorithm  to  process  the  examples,  as  a 

function  of  the  input  parameters.    A  formal  consideration  of  these  concepts  is  found  in 

Natarajan  (1991). 

Sample  complexity  and  time  complexity  are  not  comparable  concepts  (Natarajan 
1991;  Blumer  et  al.  1989).  The  time  complexity  has  an  inseparable  representation,  or 
naming  convention,  for  the  algorithm  to  use  when  assigning  names  to  each  element  in  a 
class  of  concepts.  The  sample  complexity,  on  the  other  hand,  does  not  consider  any 
representation  for  use  by  the  learning  algorithm.  This,  in  essence,  is  why  a  comparative 
analysis  between  the  sample  and  time  complexity  of  a  learning  algorithm  is  inappropriate. 
The  sample  and  time  complexities  represent  two  'optimality'  conditions  for  developing 
an  efficient  decision  tree  and/or  feature  construction  algorithm.  Ideally,  we  prefer  that 
these  measures  be  polynomially  bounded  by  the  size  of  the  input  data.  We  do  not 
explicitly  consider  both  complexity  measures  in  this  research,  however,  we  do  discuss 
their  relationship  in  our  model.  In  general,  other  optimality  criteria  such  as,  for  example, 
having  a  minimum  number  of  nodes  in  the  tree,  are  typically  used  for  constructing 
'efficient'  decision  trees. 

If,  for  example,  we  choose  as  our  optimality  criteria,  the  number  of  tests  needed 
to  classify  an  unknown  sample,  then  the  problem  of  constructing  efficient  decision  trees 
is  NP-complete  (Hyafil  and  Rivest  1976).  Thus,  certain  'optimal'  learning  algorithms  are 
unlikely  to  have  polynomial  time  complexity. 


95 
4.2.2   CITRE 

Matheus'  (1989)  method  of  feature  construction  combines  the  information 
correlated  among  existing  features  to  form  new  features.  He  develops  a  framework  for 
feature  construction  using  four  aspects:  (1)  need  detection,  (2)  constructor  selection,  (3) 
constructor  generalization,  and  (4)  feature  evaluation.  He  also  uses  the  framework  to 
make  a  comparative  analysis  of  eight  learning  systems  that  use  some  form  of  feature 
construction.  The  author  further  develops  and  tests  his  'CITRE'  learning  system  that 
performs  feature  construction  using  decision  trees. 

CITRE  restricts  the  number  of  new  features  for  consideration  by  using  various 
methods  for  selecting  features.  The  author  develops  four  different  'selection  biases'  for 
use  by  the  algorithm:  (1)  algorithm,  (2)  data-based,  (3)  hypothesis-based,  and,  (4) 
knowledge-based.  Algorithm  biases  occur  when  the  implementation  of  a  feature 
construction  algorithm  is  hard-coded  to  select  one  type  of  constructor  over  another  like, 
for  example,  a  disjunct  of  literals  instead  of  a  conjunct  of  them.  Data-based  bias  is  a 
characteristic  of  a  system  that  uses  information  in  the  sample  to  help  guide  selection  of 
attributes  to  use  for  forming  features.  Hypothesis  descriptions  may  also  provide  a  basis 
for  choosing  attributes.  By  using  only  the  attributes  present  in  the  hypothesis  description, 
irrelevant  features  are  discarded  and  feature  construction  is  focused  since  we  are  using 
a  relatively  small  set  of  relevant  attributes.  Finally,  Domain  knowledge  can  serve  as  a 
selection  bias  by  restricting  the  attributes  comprising  a  feature.  For  example,  knowledge 
about  the  learning  problem  might  state  a  preference  for  the  number  of  primitive  attributes 
comprising  each  feature,  or  the  size  of  the  terms  for  the  target  concept. 


96 
In  all  of  Matheus'  (1989)  experiments,  the  number  of  new  features  generated 

without  some  form  of  bias  was  significantly  higher.    Also,  other  experimental  results 

show  that  using  a  bias  resulted  in  CPU  times  of  50  to  100  times  less.    Essentially,  the 

author  shows  that,  to  a  certain  extent,  it  is  advantageous  to  limit  the  of  new  features 

formed.    Additionally,  the  author  reveals  another  problem  due  to  using  a  large  number 

of  features— the  possibility  of  overfitting  the  data. 

Matheus  (1989)  compared  his  heuristics  to  other  existing  heuristics  used  in  feature 

construction.  In  most  of  his  experiments  the  'fringe'  selection  method  performed  as  well 

as  if  not  better  than  any  of  the  other  alternative  approaches  to  feature  construction  that 

uses  a  binary  decision  tree.     We  describe  a   'fringe'   selection  method  of  featiu"e 

construction  in  the  following  section. 

4.2.3    FRINGE  and  Dual  FRINGE 

The  heart  of  the  FRINGE  (Pagallo  1990)  algorithm  for  learning  DNF  concepts  is 
the  combination  of  attributes  along  the  positive  paths  of  the  tree  (Pagallo  1990;  Pagallo 
and  Haussler  1990).  Pagallo  (1990)  also  presents  a  Dual  FRINGE  algorithm  that  forms 
features  for  the  negative  leaves.  The  procedure  forms  the  disjunction  of  the  two  literals 
associated  with  the  last  two  decision  nodes  from  the  root  to  the  negative  leaf.  The  dual 
algorithm  facilitates  learning  conjunctive  normal  form  (CNF)  concepts.  However,  as  the 
author  shows,  small  CNF  concepts  do  not  necessarily  have  small  DNF  representations. 
Additionally,  the  author  shows  that  the  structure  of  a  minimal  decision  tree  for  a  DNF 
function  includes  the  'replication  problem'.   The  replication  problem  is  an  illustration  of 


97 

how  the  accuracy  and  conciseness  of  the  learning  system  can  be  affected  by  the  level  of 

concept-learning  difficulty.  FRINGE  constructs  features  such  that  decision  trees  using  its 
features  do  not  include  the  replication  problem. 

The  number  of  new  features  formed  by  FRINGE  has  undesirable  effects  on  the 
execution  time  of  a  single  iteration  of  the  algorithm  (Pagallo  1990).  At  each  node,  every 
variable  is  tested  in  order  to  determine  the  best  one  for  splitting  the  node  and  creating 
subtrees.  Testing  large  numbers  of  features  increases  the  time  of  one  iteration,  and,  given 
this,  the  total  running  time  of  the  algorithm  is  significantly  impacted. 

Adding  to  the  problem  is  that,  in  one  experiment,  Pagallo  (1990)  reports  that  only 
10%  of  the  total  number  of  features  formed  were  used  by  the  program  that  created  the 
final  decision  tree.  This  suggests  that  a  meaningful  amount  of  storage  and  time  were 
allocated  to  features  that  were  not  chosen  for  a  split  in  the  tree.  Pagallo  (1990)  gives 
three  reasons  why  the  features  formed  by  FRINGE  may  not  be  chosen  for  a  split:  ( 1 )  it 
is  irrelevant  to  the  target  concept,  (2)  it  has  become  part  of  a  more  useful  feature,  and, 
(3)  there  are  equivalent  ways  to  represent  a  concept  using  a  decision  tree  for  a  given  set 
of  features. 

Pagallo  (1990)  suggests  that  the  computational  efficiency  of  FRINGE  may  be 
improved  in  one  of  two  ways.  The  first  is  by  limiting  the  number  of  new  features 
included  in  the  feature  or  attribute  set.  The  second  is  to  remove,  at  each  iteration, 
features  from  the  feature  set  that  are  not  useful  to  the  learning  task.  This  is  also  known 
as  feature  pruning.    Next  we  describe  how  FRINGE  forms  new  features. 


98 
4.2.3.1    FRINGE  feature  construction 

FRINGE  uses  term  formation  rules  on  the  last  two  attributes  in  the  path  to  each 
positive  leaf  of  level  at  least  two,  counting  the  root  at  level  zero.  For  each  such  leaf, 
FRINGE  examines  the  last  two  decision  nodes  in  the  path  from  the  root  of  the  tree  to  the 
leaf  and  forms  a  feature  that  is  the  conjunction  of  two  literals—one  for  each  decision  node. 
If  the  path  to  the  leaf  proceeds  to  the  right  from  a  decision  node,  then  the  literal 
associated  with  this  node  is  just  the  attribute  of  the  test  in  the  decision  node,  otherwise 
it  is  the  negation  of  this  attribute.  Constructing  features  in  this  manner  for  all  positive 
leaves  of  the  tree  on  the  second  level  or  higher  represents  a  single  iteration  of  the 
program.  A  decision  tree  is  then  reconsffucted  for  the  next  iteration,  and,  the  stopping 
criterion  is  when  no  new  features  are  constructed. 

Consider  the  complete  binary  tree  shown  in  Figure  4.1  illustrating  FRINGE'S 
output  of  features  for  a  positive  leaf.  Essentially  we  see  that  FRINGE  constructs  a  new 
feature  that  preserves  the  truth  of  the  term  describing  the  positive  leaf.  Figure  4. 1  also 
shows  outputs  of  two  other  methods—Dual  FRINGE  (Pagallo  1990)  and  DCFringe  (Yang 
et  al.  1991).  Dual  FRINGE  (Pagallo  1990)  creates  a  feature  for  each  negative  leaf  in  a 
decision  tree  of  level  at  least  two.  The  new  features  are  disjunctions  of  two  literals:  if 
the  path  to  the  leaf  is  a  left  path  from  a  decision  node,  then  the  literal  associated  with  this 
node  is  the  variable  in  the  decision  node,  otherwise  it  is  the  negation  of  this  variable. 

DCFringe  (Yang  et  al.  1991)  makes  use  of  certain  patterns  occurring  near  the 
fringe  of  a  binary  tree  for  constructing  features  that  are  also  disjunctions  of  literals. 
DCFringe  requires  that,  for  each  positive  leaf  (1)  the  sibling  is  a  leaf,  and,  (2)  the  parent's 


99 


leaf 

FMUNGE 

Dual  FMUNGE 

DCFrtng* 

x^ 

X  s 

X  6 

X  ^ 

X%  X  ', 

X"i  X  2 
X  1  X's 

X  I    X  3 

Xj   V     x^ 
X,   V     x^ 
x',   V     X, 
X',   V     x^ 

X,   V     x'^ 
X,   V     x^ 
x',    V     x^ 
x',   V     x^ 

Figure  4. 1    A  decision  tree 


sibling  is  a  positive  leaf.  Hence,  DCFringe  decides  which  feature  to  construct  by 
considering  more  of  the  context  in  which  the  pattern  occurs.  Note  that  in  the  figure  we 
assume  that  the  requirements  are  satisfied  when  giving  the  feature. 

4.3   Time  Complexity  Models 


All  of  the  previous  methods  of  feature  construction  generate  several  decision  trees 
in  their  iterative  process  of  forming  a  finite  number  of  features.  We  propose  that  forming 


100 
features  in  this  way  may  require  an  abnormally  large  amount  of  time.   Also,  it  is  difficult 

to  develop  practical  time  complexity  models  for  these  methods  largely  because  they  use 

heuristic  approaches.    This  difficulty  led  to  our  development  of  an  approach  to  feature 

construction  that  is  (1)  not  an  iterative  procedure,  and,  (2)  based  on  a  more  'formal' 

technique. 

We  also  developed  a  time  complexity  model  for  a  decision  tree  algorithm  that 
incorporates  some  form  of  feature  construction.  Our  model  differs  from  others  found  in 
the  literature  since  we  formally  consider  the  time  required  for  forming  a  finite  number 
of  features,  also  taking  into  account  certain  conditions  necessary  for  reducing  the  rank 
of  a  decision  tree. 

Essentially,  we  find  two  types  of  time  complexity  models  in  the  literature 
associated  with  constructing  decision  trees— probabilistic  models  and  algorithmic  models. 
The  following  sections  describe  these  two  types  of  models.  For  completeness  we  discuss 
probabilistic  models,  however,  in  our  work,  we  have  an  algorithmic-model  focus. 

4.3.1    Probabilistic  Models 

Safavian  et  al.  (1991)  give  a  characterization  of  tree  design  as  an  optimization 
problem,  using  a  Bayes  model.    It  is: 

p{J,F,d)  s.t.  limited  training  sample  size 

where  P^.  is  the  overall  probability  of  error,  T  is  a  given  tree  structure,  and,  F  and  d  are 
the   feature   subsets   and  decision   rules  for  use   at  the   internal   nodes,  respectively. 


101 
Additionally,  the  authors  state  that,  for  a  limited  sample  size,  as  the  number  of  features 

increases  the  accuracy  of  the  estimates  for  the  class  conditional  densities  may  deteriorate. 

The  previous   optimization   problem   does  not  consider  the  time   or  sample 

complexity  measures  for  'optimal'  tree  design.   A  way  to  include  the  time  component  is 

to  use  an  evaluation  function,  Efn,),  for  every  node  n,  (Safavian  et  al.  1991).  The  function 

is  defined  as: 

c, 
E(n)  =  -TUn)  -  W  x  e(n)  +  J)  p,,^  x  E(n^^^ 

where  T(nj)  and  e(n|)  represent  the  computation  time  and  classification  error  for  node  n,; 
W  is  a  user-defined  weighting  factor  indicating  the  relative  importance  of  accuracy  to  the 
computation  time;  Q  is  the  number  of  descendent  nodes  of  n^;  and,  p^+j  is  the  probability 
of  access  from  node  n^  to  descendent  node  j. 

4.3.2    Algorithmic  Models 

Creating  and  adding  new  variables  that  are  conjunctions  of  up  to  k  literals  can 
resuh  in  the  addition  of  many  variables.  Wegener  (1987)  states  that  the  cardinality  of  the 
set  of  Boolean  functions  is  2  to  the  power  of  2°.  In  most  methods  of  feature  construction, 
we  choose  the  new  features  from  the  2"  variables  (i.e.,  conjuncts  of  literals),  hence  our 
choices  are  relatively  unlimited.  This  idea  is  a  principal  motivation  for  our  current  work. 

In  Chapter  1  we  described  a  polynomial  learning  algorithm  found  in  Ehrenfeucht 
and  Haussler  (1989)--F/A'D('5,rj— that  attempts  to  produce  bounded  rank  decision  trees  that 
are  consistent  with  a  given  sample.   Because  a  general  objective  of  ours  is  reducing  the 


102 
rank  of  a  decision  tree,  first  we  consider  bounded  rank  decision  trees  before  describing 

our  time  complexity  model. 

4.3.2.1    Bounded  rank  decision  trees 


rank  =  1 


rank  =  0 


height  =  1 
Figure  4.2   A  reduced  decision  tree  with  rank  =  1. 


Figure  4.2  shows  a  binary  tree  having  the  smallest  number  of  nodes  of  any  binary 
tree  having  a  rank  of  1.  In  the  sequel,  we  denote  such  a  tree  as  a  r(l)  tree.  We  can 
easily  verify  that  a  tree  in  the  set  of  smallest  r(  1 )  decision  trees  is  a  complete  binary  tree 
having  a  maximum  level  of  1,  and,  such  a  tree  has  (2^  -  1)  =  3  nodes.  This  is  a  binary 
tree  with  root  labeled  v^,  left  subtree  Qq,  and  right  subtree  Q,,  such  that  each  of  the 
subtrees,  Qo  and  Q,,  consist  of  only  a  root  node  labeled  either  0  or  1  (i.e.,  a  leaf). 

A  tree  in  the  set  of  2"^  smallest  r(l)  decision  trees  has  5  nodes.  Note  that  we  can 
easily  verify  that  a  binary  tree  with  5  nodes  has  two  internal  nodes.  Hence,  this  is  a 
binary  tree  with  root  labeled  v„  left  subtree  Qo,  and  right  subtree  Qi,  such  that  one  of  the 


103 
subtrees,  Qo  or  Q,,  consist  of  only  a  root  node  labeled  either  0  or  1  (i.e.,  a  leaf),  and,  the 

other  subtree  is  a  member  of  the  set  of  smallest  r(l)  decision  trees.   Note  that  the  height 

of  such  a  tree  is  two.  Continuing  in  this  manner,  a  member  of  the  set  of  3"*  smallest  r(l) 

decision  trees,  is  a  binary  tree  with  root  labeled  v<.  left  subtree  Qq,  and  right  subtree  Q,, 

such  that  one  of  the  subtrees,  Qo  or  Q,,  consist  of  only  a  root  node  labeled  either  0  or  1 

(i.e.,  a  leaf),  and,  the  other  subtree  is  a  member  of  the  set  of  2""^  smallest  r(l)  decision 

tree. 


FOR  AU  THE  TREES 

rank    =  1 
height  =  3 


I I    ==>  le<rfnode 


Figure  4.3   3"*  smallest  decision  trees  of  rank  1 . 


104 
Figure  4.3  shows  a  collection  of  3'"  smallest  r(l)  decision  trees.    Note  that  only 

the  orientation  of  the  trees  differ  while  the  node  labels  remain  unchanged.   In  general,  a 

member  of  the  set  of  kth  smallest  r(l)  decision  trees  is  a  binary  tree  with  root  labeled  v^, 

left  subtree  Q^,  and  right  subtree  Q,,  such  that  one  of  the  subtrees,  Qo  or  Q,,  consist  of 

a  root  node  labeled  either  0  or  1  (i.e.,  a  leaf),  and,  the  other  subtree  is  a  member  of  the 

set  of  (k  -  l)th  smallest  r(l)  decision  trees. 

From  the  previous  discussion,  we  see  that,  to  a  certain  extent,  a  subtree  (i.  e.,  Qo 

or  Qi)  represents  a  'smaller'  size  of  the  sample  since  it  is  a  partition  of  a  larger  sample 

using   some   given   measure.      This   result   lends   itself  well   for   using   a  recursive 

implementation  to  build  r(  1)  trees.    A  r(l)  tree  with  k  internal  nodes  has  a  total  of  (2k  + 

1 )  nodes  or  labels.  Using  this  method  of  constructing  r(l)  decision  trees,  we  may  be  able 

to  decrease  the  rank  of  a  given  tree  incorporating  a  predefined  set  of  useful  or  'relevant' 

features  formed  using  the  attributes  appearing  in  the  tree.  We  can  consider  a  set  of  useful 

features,  in  this  context,  as  a  feature-set  with  the  following  properties: 

(1)  each  feature  corresponds  to  a  subspace  of  lower  dimensionality  than  the 
original  n-dimensional  feature  space 

(2)  the  subspace  can  be  successfully  partitioned,  using  some  measure  or  test, 
where  at  least  one  of  the  partitions  contains  a  predominance  of  examples 
belonging  to  a  single  class 

(3)  there  exist  at  least  one  r(l)  decision  tree  partitioning  the  original  n- 
dimensional  feature  space,  whose  internal  nodes  are  all  elements  of  the 
feature-set 

An  example  of  a  useful  set  of  features  is  a  feature-set  consisting  of  features  for  all 

Boolean  functions  represented  by  a  conjunction  of  up  to  k  literals.   In  many  applications, 

the  set  of  primitive  attributes  is  not  a  useful  feature-set. 


105 
4.3.3   Research  Model 

Recall  the  time  complexity  of  a  polynomial  learning  algorithm  for  bounded  rank 
decision  trees  (Ehrenfeucht  and  Haussler  1989): 

For  any  nonempty  sample  S  of  a  function  on  X„  and  r  >  0,  the  time  of 

FIND(S,r)  is  0(/S/(n+l  f'). 
Note  that  the  authors'  focus  is  on  learning  target  functions  represented  as  decision  trees 
by  drawing  random  examples  of  it.  Consider  the  term  (n  +  1).  Observe  that  'n' 
represents  the  number  of  Boolean  variables  or  attributes  used  to  construct  the  tree.  After 
adding  all  of  the  new  features  formed  from  a  set  of  primitives,  if  FIND(S,r)  is  successful, 
the  learning  algorithm  reduces  to  Rivest's  (1987)  algorithm  (i.e.,  a  Decision  List),  for  r 
equal  to  one  (Ehrenfeucht  and  Haussler  1989).  Hence,  the  time  complexity  of  the 
algorithm  is  reduced  to  a  polynomial  in  n  when  the  rank  is  one. 

The  previous  result  does  not  directly  address  the  behavior  of  the  time  complexity 
as  we  add  additional  features  to  a  set  of  primitive  attributes  used  to  build  a  tree.  An 
objective  of  this  research  is  to  investigate  how  the  time  complexity  behaves  as  we 
increase  the  number  of  features  while  also  allowing  the  rank  of  the  tree  to  decrease. 

Now,  suppose  we  only  use  the  primitive  attributes  to  build  an  initial  decision  tree. 
For  a  given  sample  S,  consider  the  factor  of  the  time  complexity,  (n+1)^',  where  n 
represents  the  number  primitive  attributes  for  the  sample.  We  want  to  add  additional 
features  to  the  set  of  primitive  attributes  and  build  a  decision  tree  such  that  the  new  time 
complexity  is  less  than  or  equal  to  the  time  complexity  result  from  using  the  initial  set 


106 
of  primitives.  This  requires  that  the  tree-construction-algonthm  constructs  a  decision  tree 

of  a  lower  rank,  or,  builds  the  next  'best'  tree,  if  possible. 

Suppose   we  add  j   new  features  to  the  feature-set  originally  containing  the 

primitive  attributes  to  construct  a  decision  tree  of  rank  (r  -  i),  where  i  is  an  integer 

between  0  and  r.   We  seek  bounds  on  j,  for  a  given  i,  such  that,  for  n  >  0,  equation  (4.1) 

holds. 

in  +  lf'    >    (n  +  l+j)^'-"^  (4.1) 

Thus,  we  desire  to  show  a  general  solution  for  j.  Note  that  equation  (4.1)  does 
not  reflect  the  computational  effort  required  for  constructing  the  j  new  features.  Let  FC(J) 
represent  this  time  complexity.  Adding  this  component  to  (4.1)  gives  a  time  complexity 
model  for  using  feature  construction  to  build  decision  trees.  Hence,  our  research  model 
is  the  following: 

(«  +  l)'^    >    in  +  l+jY'-'^     +    FC(j)  (4.2) 

Note  that  for  j  =  0,  we  require  that  FC(j)  =  0. 

Using  the  previous  result,  we  find  appropriate  values  for  j  in  the  following 
manner.  Given  an  r,  and  using  (4.2),  we  determine  the  value  of  j  for  all  values  of  i 
between  0  and  r,  exclusively.  This  iterative  procedure  determines  a  unique  j-value  for 
each  of  the  (r-1)  i  values. 

For  our  analysis,  we  separate  the  complexity  of  constructing  decision  trees  using 
feature-construction  into  two  components-the  tree-construction  component  or  (n-i-l-f-j)'''"'\ 
and,  the  feature-construction  component  or  FC(j).    Initially,  we  wish  to  obtain  a  bound 


107 
on  the  largest  j  for  a  fixed  i.     If  there  is  an  improvement  on  the  order  of  the  time 

complexity,  the  preliminary  result  established  in  (4. 1 )  is  paramount  since  FC(j)  >  0. 

4.3.3.1    Tree-construction  component 

Considering  (4.1),  for  the  smallest  value  of  r,  (i.e.,  r=2)  and  i=l,  this  is  a  second- 
order  equation  in  j.  We  can  easily  apply  the  quadratic  formula  to  derive  a  solution  for 
j.    This,  however,  does  not  apply  in  the  general  case  for  r  »  1. 

Another  way  to  solve  for  j  is  to  use  Taylor's  Theorem.  Let  x  =  (n-i-1)  and  f(x)  = 
X'""".  Given  n,  r,  and  i,  we  know  f(x)  but  not  f(x-Hj)  (i.e.  the  first  term  on  the  right  hand 
side  of  (4.2)).  Using  the  results  of  Taylor's  theorem,  we  can  approximate  the  value  of 
f(x-Hj)  using  the  first  few  terms  of  the  Taylor  Series  Expansion  for  f(x). 

The  Taylor  series  expansion  of  a  function  like,  for  example,  f(x),  states  that: 

2!  3!  4! 

In  this  illustration,  f'(x)  is  the  first  derivative  of  f,  f"(x)  is  the  second  derivative,  etc.  We 
now  perform  a  Taylor  expansion  for  our  function. 

Let  f '"  (x)  represents  the  mth  derivative  of  the  function  f(x)  =  x^*'"",  and  dx  =  1,  then 
in  general,  we  have  the  following  equality: 

fyx)  =      ^-(^-')>'    ^.2ir-,)-m  y-^,.  (,._/)>o^  m=l,2,3,... 

{2(r-i)-ni)\ 


108 
The  Taylor  series  expansion  for  our  function  is  given  by: 

We  improve  the  accuracy  of  the  approximation  by  using  more  terms  from  the  series. 
Also,  we  get  better  approximations  to  f(x+j)  when  j  is  small.  For  this  research,  we  use 
a  second-order  approximation  for  the  function.  Note  that  x,  j,  and  i  are  all  non-negative 
and  r  >  i.  Hence,  all  terms  of  the  series  are  non-negative  numbers,  and  we  get  the 
following  result: 

{x^jY'-"  >  x^'^-"  +  (2(r-/)jf  ^-'>  -  ')y  +   (2(r-/)-l)(2(r-0)x^->-^^., 

Putting  this  result  in  (4.1)  gives: 

^2.  >  ^.2,.-,)  ^   2ir-i)x^.  ^   (2(r-/)-l)(2(r-/)U^-y 
X  2x' 

From  this  result,  we  desire  a  j  such  that  the  following  inequality  holds: 


_(^2.)  ^  ^2,.-,-,  ^   2(r-i)x^.  ^   (2(r-/)-l)(2(r-/))Ar^'-y  ^  ^ 

X  2x' 


We  determine  an  upper  bound  on  j  by  using  the  quadratic  formula  to  get  the  zeroes  of 
the  previous  result.    Let  the  coefficients  of  j  be  denoted  by  a,  b,  and  c. 


109 


c  =  ;r^^-'^  -  x^' 

D    = 


a  = 


X 

l(r-i){l{r-i)  -  l)x^^-'^ 


2x^ 
Then  the  'zero'  of  the  polynomial  determining  a  nonnegative  upper  bound  on  j  is  given 
by  following  equation: 


-b  +  \b^  -  Aac 

J  =  -^ 

2a 

We  illustrate  the  result  with  the  following  example. 

EXAMPLE  4.1 

Let  r  =  2,  i  =  1,  and  n  =  5.   Then  x  =  6  and  (r-i)  =  1  and 


.  ^    -12  +  \/l2*  -  4(-1260) 

^  2 

=  30 

For  this  example,  j  is  just  less  than  the  total  number  of  alternative  new  features 

we  may  add  (i.e..  2'  =  32).    Interpreting  this  result  using  (4.1)  means  that,  given  a  set  of 

5  prime  attributes,  we  can  add  at  most  30  new  features,  which  are  combinations  of  these 

attributes.   If,  after  adding  the  features,  we  can  construct  a  decision  tree  with  rank  =  (2  - 

1 )  =  1 ,  then  the  order  of  the  time  complexity  of  the  algorithm  will  be  no  greater  than  the 

order  of  the  time  complexity  resulting  from  using  the  initial  n  attributes  (assuming  FC(J) 

is  negligible). 

So  far,  we  have  not  considered  the  computational  effort  required  for  constructing 

the  new  featiu-es  we  added.  The  second  complexity  component  in  (4.2),  FC(j),  represents 


110 
the  computational  effort  for  creating  a  feature-set.  This  task  consists  of:  (1 )  searching  the 

space  of  possible  features,  (2)  choosing  j  features,  and,  (3)  adding  the  j  features  to  the  set 

of    initial    prime    attributes.       TTie    following    sections    describe    our    approach    for 

accomplishing  these  tasks. 

4.4    Finding  New  Features 

Recall  that  there  are  a  total  of  2"  features.  Matheus  (1989)  gives  results  showing 
that  the  overall  search  problem  for  a  "useful'  set  of  features  is  intractable  in  the  general 
case.   This  result  estabUshes  an  need  for  search  procedures  that  use  good  heiu'istics. 

Bounds  on  the  number  of  features  constructed  by  a  feature-construction-method 
depend  on  whether  or  not  some  form  of  bias  is  used  for  limiting  the  number  of  features 
for  consideration  by  the  method.  Methods  that  do  not  use  any  bias  for  constructing 
features  usually  generate  a  higher  number  of  features  and  uses  more  CPU  time  than  those 
that  do  (Matheus  1989). 

4.4.1    Searching  a  Feature  Space 

In  order  to  'efficientiy'  search  a  feature-space,  we  focus  on  the  problem  of 
searching  the  space  of  all  possible  feature-sets  for  a  given  set  of  primitives.  In  other 
words,  for  a  given  n,  r,  and  i,  the  time  complexity  of  the  problem  is  now  expressed  in 
terms  of  j. 

Suppose  that  for  a  given  sample,  we  have  a  decision  tree  of  rank  r  on  n  primitive 
attributes.   Recalling  the  Taylor  Series  expansion,  let  j^.i  denote  the  maximum  number  of 


Ill 

features  we  can  add  for  constructing  a  decision  tree  of  rank  (r  -  1),  and  let  j^.j  be  the 
maximum  number  of  features  for  constructing  a  tree  of  rank  (r  -  2),  etc.  Then  j,  is  the 
maximum  number  of  features  we  can  add  for  constructing  a  decision  tree  of  rank  1  (i.e., 
a  r(l)  tree).   We  now  establish  the  following  lemma. 

LEMMA  4.1 

Assume  x,  r,  and  i  are  positive  integers  such  that  r  >  i  >  1.  Let  k  =  (r  -  i).  For 
constant  x  and  r,  given  k,  let  j\  denote  the  maximum  nonnegative  value  of  j  for  the 
following  inequality  to  hold: 

-X-  .  (x^)  ^  ^!^j  ^  ^?:t}M}5lr  ^  0  (4.3) 

X    '  2x^ 

Then  y^  is  decreasing  in  increasing  k. 

Proof 

We  want  to  show  that  7  °^  l/^.  To  show  this  we  first  determine  a  relationship  for 
j  in  terms  of  k.  Next,  we  show  that  the  first  derivative  of  our  relationship  with  respect 
to  k  (for  constant  x  and  r)  is  less  than  zero.   Given  this,  it  follows  that  j  °<=  1/k. 

Note  that  the  conditions  of  Lemma  4.1  suggest  the  following: 

(1)  r  >  i  >  1  and  k  =  (r  -  i)   imply  1  <  k  <  r  -  1; 

(2)  k  >  1  implies  2k  -  1  >  1  and  2k  >  2;  and, 

(3)  x''  >  0. 

Each  coefficient  of  j  is  a  strictiy  increasing  function  of  k.   To  show  this,  we  must  show 
that  the  fu-st  derivative  of  the  coefficient  with  respect  to  k  is  greater  than  zero.    Hence, 
consider  the  coefficient  of  j,  we  have: 
since  ln(x)  >  0  for  positive  x.    Now,  considering  the  coefficient  of  j^: 


12 


d 
dk 

X 

= 

-{kix^nx)  +  x^[ 

X 

> 

0 

d 
dk 

^  m-l)(2k)x^ 

^  (7.k(: 

.k-l) 

x^lnx  + 

;c"(8 

2x^ 
>    0 

Considering  the  left  hand  side  of  the  inequality  of  Lemma  4.1,  we  use  the 
quadratic  equation  to  determine  a  relationship  for  j.  Then  we  show  that  the  first 
derivative  of  this  relationship  with  respect  to  k  is  >  0.  If  (4.3)  were  an  equality,  then  the 
solution  for  j  using  the  quadratic  formula  is  given  by: 


2kx 


2k 


J       = 


N 


4k  ^x 


2^4* 


4_(2^-l)(2^)x^   ^._^, 

2x^ 


{2k-\){2k)x'^ 


Focusing  on  the  term  under  the  radical  gives: 


113 


4k 


.  2v-  4i- 


■x^  _  ^(2k-Vm)x^2,  ^  ^{2k-l)(2k)x''^2r 


\\      X- 


\\    X' 


Ax 


2k 


{kh^  -  (2k-l)(k)x^  +  i2k-\)(k)x^j 


4x 


2k 


^    X- 


{k h^  -2kh^  *  kx^  +  {2k - 1  )(k)x ^') 


4x 


2k 


>J    x- 


J      = 


(-kh^  +  kx'^  +  {2k-l){k)x^') 


^^"  ±  —\l-kh^  +  Ax^  +  (2A-1)(A:)a:^^ 
(2A:-l)(2A-)x^- 


i^(-yb:  *•  ±  V'-A.'x^-  +  Jtx^  +  (2A-l)(A)x'^  ) 


{2k-\){2k)x^ 


4-fcf '  ±  f^kh^^+kx^ 


{2k-\){k)x^' 


(2k-l){k)x' 
Again  focusing  on  the  term  under  the  radical  gives: 


±  ^-k'x^  +  kx-'  +  (2yl--l)a-)A-^ 


=    + 


N 


.2k 


,2       ,       ;,       ,        i2k-\){k)x'' 


-k^  +  A 


.Ik 


=    ±  \/a-^(-A-  +  a-  +  (2A-l)(m'^-^) 
=    ±  x'\lk(\-k)  +  (2A-l)(A-)jc^'^-*' 


Considering  the  term  under  the  radical,  k  >  1  implies  that  k(l-k)  <  0.   This  means  that: 


14 


For  "-",  the  opposite  holds.  Considering  the  term  under  the  radical  on  the  right  hand  side 
of  the  inequality,  we  have: 


=    x\k)x^'-'^ 


2k -\ 


( 


kx 


.-.     j    <    X. 


-1   +  x''-' 


{2k-\mx' 


-I    +  X 


ir-k) 


<       X. 


N 


2-1 
k 


2k -\ 

Observe  that  (2  -  l/k)  <  2  since  k  >  1.  This  means  that  the  square  root  of  (2  -  1/k)  is  less 
than  the  square  root  of  2.  Given  this  result,  we  now  show  a  relationship  for  j  in  terms 
of  k,  for  constant  x  and  r. 


;    <    X 


.  -1   +  \/2"-V 


ir-k) 


2k-] 


J2x''-" 
<     I X 

2k -\ 


.:    j    <    ^{x)x  i(2k-l)x ')' 


(4.4) 


Hence,  we  claim  that  any  positive  j  that  satisfies  (4.3)  must  also  satisfy  (4.4). 

We  now  want  to  show  that  the  first  derivative  of  our  relationship  with  respect  to 
k  is  <  0. 


15 


t  [/2{x)xii2k-l)x'y 


dk 


{{2k-l)x'f 


{2k-\)x'       {2k-\fx 


;,.  k 


Given  positive  integers  x  and  r,  we  have: 


yf2{x)X  '  >  0 

i2k-\)x'  >  0 

{2k-\fx'  >  0 

Inx 


{2k-l)x' 
2 


>  0 


>  0 


(2A--l)^x* 

since  (2k  -  1)  >  1  and  ln(x)  >  0.   Thus,  each  factor  in  the  previous  equation  is  a  positive 
number.  Therefore,  the  first  derivative  of  our  relationship  with  respect  to  k  is  a  negative 


number,  or, 


—  [/2(x)x  i(2k-l)x  *)'']    <    0 
dk 


and  this  is  what  we  wanted  to  show. 

Now,  construct  up  to  j,  new  features  and  add  them  to  the  set  of  variables. 
Determine  whether  we  can  now  construct  a  r(l)  decision  tree.  If  we  can  construct  such 
a  tree  then  we  are  done.  If  not,  then  determine  whether  we  can  construct  a  r(2)  decision 
tree  using  the  first  j.  features  previously  constructed.  If  we  can,  then  we  are  done. 
Otherwise,  continue  in  a  like  fashion. 

A  key  premise  in  our  research-model  hypothesis,  is  being  able  to  construct 
decision  trees  of  smaller  ranks.  If  we  cannot  construct  such  trees  after  creating  a  hmited 


116 
number  of  features,  then  the  features  may  not  be  useful  for  our  purposes  (ideally  we  want 

to  construct  r(l)  trees).    Using  this  broad  description  of  useful  features,  we  require  a 

framework  for  characterizing  the  'usefulness'  of  features.  In  the  sections  that  follow,  we 

discuss  several  concepts  helpful  in  determining  featiu-e  subsets  to  use  when  constructing 

decision  trees. 

4.5   Feature-Representation  Models 

Feature  construction  is  a  form  of  representation  change  where  each  feature,  or 
term,  in  the  concept  description  is  a  conjunct  of  the  prime  attributes.  Korf  (1980) 
presents  a  model  for  searching  in  a  'space'  of  representations  where  information  structure 
and  information  quantity  are  the  two  dimensions  in  this  representation  space.  Changes 
along  these  dimensions  are  characterized  by  (1)  isomorphisms  or  changes  of  information 
structure,  and  (2)  homomorphisms  or  changes  to  information  quantity. 

In  feature  construction,  the  representation  space  we  are  searching  consists  of  the 
2"  possible  features,  where  each  feature  represents  a  point  in  the  space.  For  this 
discussion,  we  explore  the  areas  considered  by  Korf's  heuristic  for  searching  a  problem- 
representation  space  (Korf,  1980).  These  are:  (1)  characterizing  the  space  of  possible 
features,  (2)  characterizing  the  operators  of  this  space,  (3)  evaluating  different 
representations  with  respect  to  problem  solving  efficiency,  and  (4)  effectively  searching 
the  space  of  features  to  find  'useful'  features  for  problems.  The  author  primarily  focuses 
on  the  first  two  areas,  we  study  areas  (3)  and  (4). 


117 
In  order  to  evaluate  different  representations  with  respect  to  problem  solving 

efficiency,  we  first  identify  representation  models  found  in  the  literature  and  describe  the 

characteristics  of  each.  The  models  we  discuss  include  OCCAM'S  RAZOR,  MDLP,  and 

Boolean  Formulae.   Recall  that  the  time  complexity  of  a  learning  algorithm  incorporates 

a  naming  convention  for  assigning  names  to  elements  in  a  class  of  concepts.    Each  of 

these  models  represent  a  reasonable  naming  convention  that  we  may  use  since  they  offer 

alternative  criteria  for  consideration  when  assigning  names  to  elements.   The  following 

sections  describe  these  models  and  also  discuss  representation  classes  and  computation 

models  for  Boolean  Formulae~a  general  feature-representation  model  used  in  our  work. 

4.5.1    OCCAM'S  RAZOR 

OCCAM'S  RAZOR  is  based  on  William  of  Occam's  principle  of  parsimony.  This 
principle  says  that,  all  other  things  being  equal,  if  there  are  two  explanations  for  the  data, 
then  choose  the  simpler  explanation  of  the  two.  In  machine  learning,  using  this  principle 
amounts  to  discovering  the  simplest  hypothesis  that  explains  the  data  .  To  use  an  Occam 
approach  for  determining  useful  features,  assume  that  there  is  some  fixed  encoding 
scheme  for  representing  the  hypothesis  and  the  examples,  or  observations,  as  well.  Note 
that  using  this  scheme  also  allows  us  to  encode  all  features  for  the  data.  Let  the 
complexity  of  a  feature  be  defined  as  the  number  of  bits  needed  to  encode  the  feature  in 
the  given  representation.  If  we  have  two  or  more  features  describing  a  given  subset  of 
the  data,  then  an  Occam  approach  suggests  choosing  the  feature  with  the  minimum 
complexity.  To  use  this  approach,  knowing  all  possible  features  describing  a  given  subset 


118 
of  the  data  helps  determine  if  we  have  a  feature  of  minimum  complexity.    This  is  not 

always  practical   since,   for  example,   finding   a   minimum   length   DNF  expression 

consistent  with  a  sample  of  a  Boolean  function  is  an  NP-hard  problem.  Hence,  we  do  not 

consider  using  an  Occam  approach  for  determining  useful  features. 

4.5.2    MDLP 

A  second  idea  of  interest  is  known  as  the  Minimum  Description  Length  Principle, 
or  MDLP  (Rissanen  1986;  Quinlan  and  Rivest  1989).  MDLP  says  that,  given  a  set  of 
data,  the  best  'theory'  to  infer  from  the  data  is  the  one  which  minimizes  the  sum  of  (1) 
the  length  of  the  theory,  and,  (2)  the  length  of  the  encoded  data  using  the  theory  as  a 
predictor  for  the  data.  Applying  MDLP  to  the  construction  of  decision  trees,  we  would 
construct  (1)  the  smallest  decision  tree  which  perfectly  predicts  the  class,  or,  (2)  decision 
trees  with  the  smallest  possible  error  rate  in  classifying  unseen  objects  (Quinlan  and 
Rivest  1989). 

A  natural  extension  of  this  idea  for  determining  useful  features  is  to  defme  a  set 
of  useful  features  as  the  feature-set  used  to  construct  one  of  the  two  types  of  decision 
trees  previously  mentioned  where  a  majority  of  the  features  are  used  in  the  tree.  This 
implies  that  the  feature-set  also  aids  in  reducing  the  rank  of  a  tree.  For  our  work,  the 
'theory'  or  concept  we  want  to  infer  is  a  Boolean  function.  In  the  section  that  follows, 
we  examine  ways  to  describe  the  complexity  of  these  functions  using  an  MDLP  approach. 


119 
4.5.3    Boolean  Formulae 

Wegener  (1987)  gives  a  complete  consideration  of  the  complexity  of  Boolean 
functions,  including  upper  and  lower  bounds  for  describing  the  complexity  of  certain 
problems.  Our  work  focuses  on  finding  efficient  procedures  that  construct  useful  features 
which  are  conjuncts  of  up  to  k  literals.  Additionally,  the  running  time  of  these  procedures 
has  to  be  measured  in  terms  of  the  size  of  their  input 

For  the  concept-class  of  Boolean  functions,  F,  a  time-complexity  analysis  of  the 

leaming  algorithms  for  F  shows  that  learnability  may  depend  on  the  representation  chosen 

for  assigning  names  to  each  function  f  e   F  (Natarajan   1991).     The  choice  of  a 

representation  may  depend  on  the  'costs'  of  using  it  (Wegener  1987),  however,  there  are 

other  criteria  that  one  can  use  for  choosing  a  representation.  In  this  work  we  use  the  cost 

of  representing  a  Boolean  function  determined  by  the  following  definitions. 

DEFINITION  4.1 

i)  A  monomial,  m,  is  either  the  constant  function  1  or  a  conjunction  of  up  to  k 

literals.   The  cost  of  m  is  equal  to  the  number  of  literals  of  m. 

\i)  A  Boolean  function  f  s  F,  is  a  disjunction  of  monomials.  The  cost  off  may  be 
given  by  one  of  two  methods.  The  first  method  defines  the  cost  of/  to  be  equal 
to  the  sum  of  the  costs  of  all  monomials  summed  up  by  /.  The  second  method 
defines  the  cost  of/ as  the  number  of  monomials  summed  up  by/ 

These  measures  reflect  the  cost  of  representing  /     Before  considering  how  to 

obtain  cheaper  representations  for  /  we  first  discuss  ways  to  represent  and  compute 

Boolean  functions. 


120 

4.5.3.1  Representation  classes 

Incorporating  the  results  of  the  previous  section  in  our  model  amounts  to  re- 
defining the  learning  task.  Instead  of  trying  to  learn  the  concepts  themselves,  we 
concentrate  on  learning  a  class  of  representations  of  boolean  concepts. 

DEFINITION  [representation  class,  Kearns  et  al.  (1987)] 

A  class  of  representations  of  concepts  is  a  set  A  =  \\  A    where  for  each 
n,  A„  is  a  subset  of  all  possible  formulae  of  n  variables.    For  example,  for  each 

constant  k,  kDNF  =  (J   (  kDNF  over  n  variables  )  is  a  class  of  representations, 

where  ^DNF  denotes  disjunctive  normal  form  in  which  each  disjunct  consists  of 
at  most  k  literals.    (286) 

Given  a  class  of  representations.  A,  for  each  f  e  A,  let  size(f)  represents  the  fewest 

number  of  symbols  needed  to  write  the  representation  for /in  A. 

We  describe  a  boolean  function  using  a  table,  x  ->  f(x).     The  length  of  such  a 

table  is  2"  where  each  entry  is  a  conjunct  of  at  most  k  literals.    Also,  it  is  sufficient  to 

specify  f  "'(1 )  or  f  "'(0).  (where  f(x)=l  or  f(x)=0),  if  f  is  an  element  of  the  set  of  Boolean 

functions  (Wegener  1987).  We  now  give  several  results  on  the  laws  of  computation  using 

Boolean  functions. 

4.5.3.2  Computation  models 

To  compute  the  value  of  a  function  /,  we  take  the  disjunct  of  the  monomials  of 
/.    The  DNF  of  /  is  an  illustration  of  a  polynomial  for  /.    Hence,  we  need  a  way  to 
describe  cheaper  polynomials. 
DEFINITION  4.2 

If  p(x)  =f(x)  for/€  F  and  x  s  /"'({0,1 }),  then  p  is  a  polynomial  for/. 


121 
DEFINITION  4.3 

If  a  polynomial,  p,  determines /e  F  and  no  other  polynomial  for /has  a  smaller 
cost  than  p,  then  p  is  a  minimal  polynomial  determining  /. 

We  find  the  cost  of  a  polynomial  using  the  cost  measures  given  previously. 

Our  goal  is  to  find  useful  monomials  for  /,  and  not  a  minimal  polynomial  for  /. 

4 

The  previous  discussion  suggests  that  the  monomials  in  a  minimal  polynomial  for  / 
determine  the  features  in  a  minimal  feature-set  for  a  decision  tree  representing/.  Hence, 
we  desire  a  way  to  identify  the  types  of  monomials  that  determine  minimal  feature-sets. 
Wegener  (1987)  characterizes  certain  classes  of  monomials  as  implicants  and 
prime  implicants.  The  author  gives  results  showing  that  minimal  polynomials  consist  only 
of  prime  implicants.  The  author  also  gives  the  conditions  necessary  for  obtaining  cheaper 
polynomials  and  reviews  an  algorithm--namely  the  Quine  and  McCluskey  algorithm— 
which  computes  the  set  of  all  prime  implicants  of/ with  length  k. 

DEFINITION  [implicants  and  prime  implicants,  Wegener 

(1987)] 

Let  p  =  m,v-vmj.  be  a  polynomial  for  a  boolean  function  f.  m^(a)  =  1 
implies  p(a)  =  1  and  f(a)  6  { 1 }.  A  monomial  m  is  an  implicant  of  f  if  m"'(l)  c 
f "'({ 1 }).   1(f)  is  the  set  of  all  implicants  of  f. 

An  implicant  m  €  1(f)  is  called  prime  implicant  if  no  proper  submonomial 
of  m  is  an  implicant  of  f.   PI{f)  is  the  set  of  all  prime  implicants  of  f.    (24) 

It  appears  that  implicants  possess  two  of  the  properties  associated  with  a  useable 
set  of  features  previously  described:  (1)  they  generally  correspond  to  a  subspace  of  lower 
dimensionality  than  the  original  n-dimensional  feature  space,  and,  (2)  this  subspace  may 


122 
be  partitioned  such  that  one  of  the  partitions  corresponds  to  f "'({ 1 })  or  f "'({()}).   Thus, 

implicants  may  be  one  way  to  characterize  the  elements  of  a  usable  feature-set. 

4.6   Feature-Construction  Models 

In  this  section  we  discuss  two  general  methods  for  feature  construction  before 
presenting  our  procedure.  These  methods  are  an  exhaustive  approach  and  binaiy  tree 
construction.  After  reviewing  these  methods,  we  show  that  they  are  computationally  too 
expensive  for  further  consideration. 

4.6.1    Exhaustive  Approach 

The  first  method  of  feature  construction  we  consider  simply  amounts  to 
constructing  all  subsets  firom  the  set  of  primitive  attributes.  The  cardinality  of  the  feature- 
set  is  equal  to  2°  for  n  primitives.  This  exhaustive  approach  for  creating  a  feature-set 
requires  exponential  computational  cost  in  terms  of  'n'. 

For  example,  given  V„,  consider  a  decision  tree  for  a  given  sample  S  of  rank  r 
over  V„.  Now  construct  the  T  subsets  of  V„  where  each  subset  is  a  conjunct  of  at  most 
k  literals,  k  =  n,  and  add  them  to  the  feature-set.  Each  feature  in  the  set  of  features  may 
be  considered  as  a  monotone  monomial.  A  monotone  monomial  is  a  monomial  that 
doesn't  contain  any  negated  literals.  For  simplicity,  let  the  cost  of  feature  construction 
be  equal  to  the  number  of  constructed  features.  For  FC{j)  =  j  =  2°,  equation  (4.2) 
becomes; 

(n+l)''>(n+l+(2"-n)f '  +  2" 


123 
Note  that  the  2°  features  include  the  primitive  attributes,  or  n  features  of  size  one,  hence, 

we  only  add  (2''-n)  features  to  the  set  of  primitives. 

Since  2°  is  positive,  we  derive  the  following  result: 

(n+lf  >(2°+lf^'' 
Thus,  initially,  we  want  to  know  the  largest  value  of  i  that  satisfies  the  inequality.  This 
represents  the  reduction  in  rank,  or,  'savings'  from  constructing  a  more  'concise' 
description  of  the  target  concept.  The  following  derivation  essentially  shows  that  we 
cannot  realize  any  'savings'  from  using  an  exponential  number  of  features  (i.e.,  any  value 
of  i  between  1  and  (r- 1 )  inclusively).  Hence,  this  approach  represents  a  computationally 
expensive  way  for  forming  features. 

Given  n,  S,  r,  and  j,  the  previous  inequality  becomes: 

(n+lf  >(2°+lf /(2°+lf 
or, 

(2"+lf  >(2°+l)'V(n+l)'' 
and  taking  the  natural  logarithm  of  both  sides  gives: 

2iln(2"+r)  >  2r[ln(2"+l)  -  ln(n+l)] 
and  solving  for  i  gives: 

ln(2'"  +  1) 


124 

Since  i  is  at  most  (r-1),  we  derive  the  following: 

^Injn  ^  1)      ^     J 
InCl"  +  1) 

We  now  consider  the  previous  result  for  increasing  n.   Using  I'Hopital's  rule  to  find  the 
limit  of  the  ratio  of  the  logarithms  gives: 

.(n2 "-') 


limit. 


■"P"  -  ■)  =  Imi.ILll 


\n{n  +  1)       n  -. «  C 

=  limit. 


n+1 
{n+l){n2"-') 


(2"  +  DC 
Multiplying  by  2"  /  2°  gives: 


Umi,  '"-""'^•"'   =  IM,        ' 


«^-    C2  ''(n2"-')       "-»  012"-' 
=  0 

and  since  0  <  1,  no  value  of  'i'  between  0  and  r  satisfies  our  criteria.  Hence,  the  2° 
features,  which  require  exponential  featiu-e-construction-time  and  feature-storage-space, 
do  not  give  an  improvement  in  the  standard  time  complexity;  they  require  an  abnormally 
large  amount  of  resources  for  processing  them. 

Modifications  to  this  approach  include  using  some  form  of  bias  to  limit  the 
number  of  subsets  to  consider.  Recall  that  CITRE  is  a  feature  construction  algorithm  that 
uses  one  of  four  types  of  bias  when  constructing  feattires. 


125 
4.6.2   Binary  Tree  Construction 

Recall  that  a  binary  tree  represents  a  function  evaluated  by  tracing  a  path  from  the 
root  to  a  leaf,  (i.e.  a  terminal  path),  according  to  the  following  rule:  "//  the  variable  in 
the  current  internal  node  is  set  to  0  then  proceed  to  the  left-subtree,  else  proceed  to  the 
right-subtree."  Hence,  the  inverse  function,  f ',  is:  if  we  are  at  the  left  subtree,  return  the 
label  of  the  parent  node  negated,  else  return  the  label  of  the  parent. 

Using  this  rule  iteratively,  we  can  construct  features  that  are  conjuncts  of  at  most 
k  literals,  where  k  is  the  number  of  internal  nodes  on  a  terminal  path.  For  this  type  of 
feature  construction,  the  features  may  not  be  monotone  monomials.  Thus,  now  our 
feature-sets  may  contain  attributes/features  and  their  negations.  This  increases  the  number 
of  alternative  features  to  choose  from,  which  in  turn  increases  the  amount  of  time  we 
need  to  consider  the  features  formed  from  a  set  of  primitives  which  are  not  necessarily 
monotone  monomials. 

As  an  example  of  binary  tree  construction,  suppose  we  have  a  binary  tree 
consisting  of  a  root  node  and  two  leaf  nodes.  Since  we  have  one  label  for  the  only 
internal  node  on  a  path,  we  can  only  construct  a  feature  that  is  a  conjunct  of  at  most  1 
literal.  Now,  consider  a  complete  binary  tree  having  a  height  of  two.  Since  there  are  at 
most  two  internal  nodes  on  each  path,  we  can  construct  features  which  are  conjunctions 
of  at  most  two  literals  for  each  terminal  path.  This  is  similar  to  the  mechanism  of  feature 
construction  used  by  the  FRINGE  algorithm. 

None  of  the  previous  methods  of  feature  construction  guarantee  the  construction 
of  minimal  monomials  which  aid  in  producing  feature-sets  of  minimal  cardinality. 


126 
Additionally,  we  have  no  way  to  determine  whether  the  features  constructed  using  these 

methods  are  indeed  useful  features  for  decreasing  the  size  of  a  target  concept  or  size(f). 

For  these  reasons,  we  desire  a  procedure  for  creating  a  feature-set  whose  use  does  not 

result  in  an  increase  of  size(f)  when  monotone  monomials  are  used  for  features. 

Furthermore,  the  procedure  must  preserve  the  basic  structure  of  the  subsets  represented 

by  the  internal  nodes  of  a  decision  tree. 

Increasing    the    size    of   a    concept    amounts    to    creating    a    more    complex 

representation  for  the  target  concept,  or,  for  example,    building  a  tree  having  a  higher 

rank  than  the  rank  of  some  other  equivalent  tree.    In  the  following  sections,  we  present 

a  procedure  that  forms  a  finite  number  of  features  using  the  'dual'  of  a  decision  tree. 

4.7   Dual  Trees 

We  developed  a  procedure  to  construct  a  finite  number  of  features  in  a  reasonable 
amount  of  time  using  principles  from  the  theories  of  sets  and  categories,  and,  automata 
theory  or  the  theory  of  finite  state  machines.  A  complete  discourse  of  automata  theory 
and  tree  automata  may  be  compiled  using  Beckman  (1980),  Berstel  (1987),  Alagar 
(1989),  Adamek  et  al.  (1990),  and  Leeuwen  (1990). 

In  Chapter  1  we  described  a  procedure,  DUALTREE,  which  forms  a  finite  number 
of  features  using  a  given  binary  decision  tree.  Based  on  Beckman 's  (1980)  method  for 
finding  'equivalent'  automatons  having  a  minimal  number  of  states,  DUALTTIEE  feature 
construction  begins  by  first  forming  the  "dual'  of  the  input  decision  tree.  DUALTREE 
essentially  consists  of  the  following  steps: 


127 

1.  construct  the  dual  of  the  input  binary  decision  tree; 

2.  perform  DUALTREE  feature  construction  on  the  dual  tree; 

3.  construct  the  dual  of  the  dual; 

4.  perform  DUALTREE  feature  construction  on  the  dual  of  the  dual;  and, 

5.  produce  a  finite  list  of  useful  features. 

Note  that  steps  3  and  4  are  primarily  the  same  as  steps  I  and  2. 

The  following  sections  illustrate  our  procedure  as  we  did  in  Chapter  1 ,  however, 
this  time  we  use  a  more  generaUzed  approach.  We  describe  DUALTREE  feature 
construction  and  we  conclude  by  verifying  our  procedure  according  to  certain  parameters. 

4.7.1    Properties  of  Dual  Trees 

Consider  making  the  following  assignments.    Let  the  subsets  determined  by  the 

internal  nodes  along  a  terminal  path  of  a  decision  tree  represent  the  objects  for  a  category. 

We  characterize  the  dual  of  a  tree  as  follows: 

DEFINITION    4.4 

If  Q  is  any  binary  decision  tree,  let  Q'  be  the  tree  obtained  by  ( 1 )  letting  the  root 
node(s)  of  Q*  be  the  terminal  node(s)  of  Q,  and  letting  the  terminal  node(s)  of  Q*  be  the 
root  node  of  Q,  and,  (2)  reversing  the  directions  of  the  edges  in  each  terminal  path  of  Q— 
a  path  from  a  root  to  a  leaf  in  Q.   We  define  Q*  to  be  the  dual  of  Q. 

The  dual  tree,  Q',  may  be  nondeterministic  (i.e.  one  cannot  specify  which  root  node  to 

start  on  with  full  accuracy),  possibly  not  a  binary  tree,  and,  from  its  construction,  it 

follows  that  (Q*)*  =  Q. 

Recall  also  from  Chapter  1  that  references  for  several  of  the  nodes  in  the  dual  tree 

may  change.  The  root  node(s)  of  Q  become(s)  terminal  node(s)  for  Q*,  and,  the  terminal 

node(s)  of  Q  become(s)  root  node(s)  for  Q*.   Consider  the  decision  tree,  Q,  from  Chapter 

1,  and,  its  dual  decision  tree,  Q*,  which  is  again  pictured  here  in  Figure  4.4.   Following 


128 


0 


\^A 


0 


Figure  4.4   The  re-oriented  dual  tree 


is  a  rule  for  traversing  the  dual  tree:  For  the  ith  terminal  path  determined  by  a  root  node 
and  a  leaf,  m'  is  a  conjunct  of  literals  determined  by  tracing  the  terminal  path  as 
follows:  starting  with  the  root  node,  if  the  outgoing  edge  has  a  value  of  1  then  return  the 
value  of  the  label  pointed  to  by  the  edge,  else  return  its  negation.  Of  course  the  leaf  is 
the  sole  terminal  node  for  all  terminal  paths  of  the  dual  tree  shown  in  the  figure. 

Using   this   bottom-up   approach   for   constructing   the   terms   of  the   function 
represented  by  the  dual  tree,  we  produce  'equivalent'  terms  for  the  terms  created  by 


129 
traversing  the  respective  paths  in  the  initial  decision  tree.  The  only  difference  is  that  the 

primitive  attributes  comprising  the  feature  are  listed  in  'reverse'  order.  Hence,  we  regard 

rrij*  as  the  dual  of  m^  in  Q. 

Many  of  the  concepts  and  properties  for  binary  trees  also  apply  to  the  dual  tree, 

however,  some  of  them  do  not.    For  example,  the  height  of  the  dual  tree  Q'  is  the  same 

as  Q,  and,  a  dual  tree  with  N  nodes  has  N-1  edges.    However,  it  is  not  appropriate  to 

speak  of  a  full  or  complete  dual  tree  without  modifying  our  definition  of  these  terms. 

4.8   DUALTREE  Feature  Construction 

Recall  that  (Q*)*  =  Q.  This  says  that  we  produce  an  equivalent  tree  by  taking  the 
dual  of  the  dual.  Our  procedure  uses  this  result  to  created  useful  feature-sets.  In  this 
section  we  present  our  method  for  constructing  useful  feature-sets  having  a  minimal 
number  of  features.  Focusing  on  the  problem  of  learning  the  minimal  monomials  of  a 
representation  class  for  a  target  concept,  DUALTREE  forms  features  that  represent 
monotone  monomials  of  a  boolean  function  that  also  aid  in  producing  concise  descriptions 
for  the  function.  The  heart  of  our  procedure  is  an  adaptation  of  a  well  known  method  for 
showing  that  any  non-deterministic  finite  automaton  can  be  replaced  by  a  deterministic 
machine  that  recognizes  precisely  the  same  set  of  tapes-subset  construction  (Leeuwen 
1990;  Beckman  1980;  Hopcroft  and  UUman  1979;  and  Nelson  1968).  We  refer  to  it  as 
feature  construction. 


130 
4.8.1    Feature  Construction 

Before  describing  how  to  apply  feature  construction  to  non-deterministic  trees,  we 

set  forth  the  following  definitions  and  assignments. 

DEFINITION    4.4 

A  fundamental  monomial  is   a  monomial   that  contains   no   more   than   one 
occurrence  of  a  variable  or  its  negation. 

DEFINITION   4.5 

Let  m,  and  mj  be  two  monomials,   m,  subsumes  mj  iff  every  literal  in  m2  is  also 
in  m,. 

Let  us  now  consider  applying  feature  construction  to  a  non-deterministic  tree,  Q, 

representing  a  boolean  function.   We  begin  by  making  the  following  assignments: 

(1)  Let  M  be  the  set  of  all  monomials  over  the  n  boolean  attributes  appearing 
the  tree. 

(2)  Let  R  be  the  set  of  all  monomials  that  are  labels  for  the  root  node(s)  of  Q, 
i.e.  R  =  {  r  I  r  e  M,  r  is  the  label  for  a  root  node  of  Q  }. 

(3)  Let  L  be  the  set  of  all  monomials  which  contain  a  proper  submonomial 
such  that  the  submonomial  is  a  label  for  the  leaf  node(s)  in  Q,  i.e.  L  -  [ 
Hie  M,  3  s,  s  is  a  proper  submonomial  of  1,  s  is  the  label  of  a  leaf  node 
ofQ}. 

(4)  Let  U,  W  be  sets  containing  all  monomials  used  as  labels  in  Q  with  u  € 
U  and  w  €  W.  Define  an  edge,  e,,  in  Q  as  the  triple  (u,a,w),  where  u  is 
the  label  of  a  predecessor  node  in  Q,  a  e  {0,1 ),  and  w  is  the  label  of 
some  successor  node  in  Q.  Let  E  be  the  set  of  all  unique  edges  in  Q,  i.e. 
E  =  {  e,  I  e|  is  an  edge  in  Q  ). 

Now  construct  a  new  tree,  Q',  whose  feature-set  is  also  a  subset  of  M,  using  the  following 

procedure.    First,  create  a  root  node  in  Q'  containing  the  nodes  in  Q  having  labels  that 


131 
are  elements  of  R.   Next,  for  each  a,  a  &  (0,1},  let  the  successor  node  of  C  c  U.  a  node 

in  Q\  be  C  determined  by  the  following: 

C  =    {}     {  c'\  {c,a,c')  €  E  } 

c  e   C 

The  construction  of  successive  nodes  continues  in  this  manner.  For  any  given 
node  C  in  Q\  where  C  is  a  subset  of  U,  the  successor  node  for  the  corresponding  edge 
in  Q\  C,  will  be  the  set  of  elements  of  W  that  are  possible  successors  of  any  element  of 
C,  which  is  a  predecessor  for  some  edge  in  E.  Every  node  will  have  an  edge 
corresponding  to  each  possible  value  of  the  function  a  €  (0,1).  Recall  from  Chapter  1 
that,  for  any  given  value  of  the  function,  the  successor  node  for  any  triple  not  in  E,  is 
defined  to  be  the  null  node. 

Now  we  recursively  build  the  tree,  and,  for  every  node,  except  a  0/1  node  or  the 
'null'  node,  we  construct  a  successor  node  for  each  value  of  the  function.  Note  that  C 
is  a  terminal  node  in  Q'  if  there  exists  a  monomial,  c'  e  C  such  that  c'  e  L.  Also,  using 
our  procedure,  terminal  node(s)  may  have  successor  node(s). 

When  the  procedure  ends  we  will  have  a  deterministic  tree,  Q\  which  is  also 
equivalent  to  Q.  Deterministic  trees  have  a  single  root  node  and  this  is  preferable.  This 
completes  our  description  of  how  to  perform  DUALTREE  feature  construction. 

4.8.2    Validation  and  Verification 

Beckman  (1980)  focuses  on  the  problem  of  finding  an  automaton  with  the  smallest 
number  of  states  that  is  equivalent  to  a  given  deterministic  finite  automaton,  where 


132 
equivalent  automatons  are  ones  that  accept  precisely  the  same  set  of  tapes.    Beckman's 

(1980)  procedure  is  as  follows: 

PROCEDURE    [Minimization  of  Finite  Automata,  Beckman  (1980)] 

Given  the  automaton  A: 

1.  Construct  the  dual  A*  of  A. 

2.  Apply  the  subset  construction  to  A*  obtaining  an  equivalent  deterministic 
version  B  of  A*.  B  recognizes  the  tapes  accepted  by  A,  viritten  backwards. 

3.  Construct  the  dual  B*  of  B. 

4.  Apply  the  subset  construction  to  B*,  obtaining  an  equivalent  deterministic 
version  C  of  B*.  C  recognizes  the  same  tapes  as  A,  and  C  will  have  the 
smallest  number  of  states  of  any  finite  automaton  equivalent  to  A.    (275) 

The  author  does  not  offer  his  proof  for  the  procedure,  however,  a  proof  that  the  procedure 
does  what  it  claims  to  do  can  be  compiled  using  Brzozowski  (1962),  Mirkin  (1966), 
Kameda  and  Weiner  (1968),  and,  Beckman  (1970).  Instead,  the  author  illustrates  the 
procedure  using  several  examples  and  shows  that  the  claim  holds.  Thus,  for  our  work, 
lets  regard  the  author's  claim  as  true. 

If  we  apply  Beckman's  (1980)  procedure  as  it  is  given  to  binary  decision  trees, 
we  encounter  a  very  key  problem--we  construct  nodes  that  are  conjuncts  of  classes!  This 
is  completely  unacceptable  since  the  classes  must  remain  mutually  exclusive.  We  co- 
mingle  the  classes  in  step  2  of  the  procedure  when  we  apply  the  subset  construction. 
Using  subset  construction,  the  start  state  or  root  of  the  equivalent  deterministic  version 
of  the  dual  is  the  set  of  start  states  or  roots  of  the  dual.  Recall  that  the  root(s)  of  the  dual 
tree  are  the  same  as  the  leaves  in  the  initial  decision  tree;  which  consist  of  the  mutually 
exclusive  classes. 


133 
Subset  construction  functions  to  replace  any  non-deterministic  machine  with  an 

equivalent  deterministic  one.  The  literature  on  finite  state  machines  details  this  widely 
applied  procedure,  and,  it  appears  appropriate  to  use  this  method  in  our  work  since,  in 
general,  a  dual  tree  represents  a  non-deterministic  finite  automaton.  For  example,  a 
decision  tree  that  is  not  a  leaf  contains  at  least  two  mutually  exclusive  classes.  This 
means  that  its  dual  tree  will  have  at  least  two  start  states  or  roots.  Thus,  in  general,  we 
have  a  non-deterministic  structure,  which  is  undesirable. 

We  resolve  the  problem  of  co-mingling  the  classes  by  modifying  the  procedure 
as  follows.  First,  after  forming  the  dual  of  some  given  decision  tree,  we  'partition"  the 
dual  tree  into  subtrees  of  the  dual  such  that  the  root(s)  of  each  dual-subtree  all  belong  to 
a  single  class.  Hence,  we  will  have  one  dual-subtree  for  every  class.  Note  that 
partitioning  the  dual  tree  in  this  manner  means  that  if  we  create  a  composite  tree  by 
connecting  the  individual  dual-subtrees  on  some  common  node  between  them,  then  this 
composite  dual-tree  is  equivalent  to  the  initial  dual  tree.  The  root  node  of  the  initial  tree 
is  a  common  node  for  all  of  the  dual-subtrees.  Partitioning  the  dual  tree  into  dual- 
subtrees  resolves  our  problem  because  now  the  root(s)  of  each  dual-subtree  all  represent 
the  same  class  assignment,  hence,  we  do  not  co-mingle  classes  when  using  DUALTREE 
feature  construction.  Each  dual-subtree  is  non-deterministic  if  it  has  more  than  one  root. 
Using  feature  construction  with  a  deterministic  dual-subtree  produces  an  identical  dual- 
subtree  for  the  given  subtree. 


134 

Given  our  modification,  we  essentially  apply  Beckman's  (1980)  method  as  it  is 

shown.    The  following  section  describes  our  modified  procedure  and  examines  various 
characteristics  of  the  features  we  form  using  it. 
4.8.2.1    Procedural  framework 

We  specify  our  DUALTREE  feature-construction  procedure  with  the  following 
steps: 

Given  a  binary  decision  tree  Q, 

1.  Construct  the  dual  Q*  of  Q. 

2.  Partition  the  dual  Q*  into  dual-subtrees,  Q,*,  where  1  e  L  =  {0,1 }. 

3.  FOR  EACH  Q,*,  perform  the  following  steps: 

3a.       Apply   DUALTREE   feature   construction   to    Qj*   obtaining   an 
equivalent  tree  Q,'  for  Q,*. 

3b.       Construct  the  dual  (Q,')*  of  Q,'. 

3c.        Apply  DUALTREE  feature  construction  to  (Q,')*  obtaining  an 
equivalent  tree  Q,"  for  (Q,')". 

4.  Construct  a  composite  tree  Q"  by  joining  all  of  the  Qi"'s  on  some  common 
node  between  them  like,  for  example,  the  null  node. 

CLAIM:  Q"  is  equivalent  to  Q,  further,  each  Q,"  contains  the  smallest 

number  of  features  of  any  tree  equivalent  to  (Q,')". 

5.  Output  a  list  of  the  features  in  Q"  that  are  helpful  for  a  given  learning 
problem.  A  feature  comprised  of  only  one  attribute  (i.e.,  a  primitive 
feature),  is  not  considered  helpful. 

The  list  of  features  is  further  processed  and  produced  as  a  finite  set  of  'useful'  features. 


135 
Considering  the  composite  tree  created  in  step  4,  we  can  say  that  this  composite 

tree  contains  a  virtual  tree  for  each  class,  where  the  leaves  in  the  virtual  tree  all  belong 

to  the  same  class.   We  offer  a  proof  of  our  claim  in  the  next  section. 

4.8.2.2   Claims  and  proofs 

We  make  a  general  claim  that  the  DUALTREE  procedure  for  forming  new 
features  produces  a  minimal  number  of  features.  Minimal  in  the  sense  that,  given 
certain  conditions,  its  features  represent  the  minimum  number  of  features  appearing  in  a 
minimally-equivalent  tree  for  some  binary  decision  tree  containing  only  primitive 
attributes.  Recall  that  a  necessary  condition  of  ours  is  not  to  create  nodes  in  which  the 
classes  are  co-mingled.  Before  verifying  our  general  claim,  we  first  verify  the  claims 
made  in  our  procedural  framework. 

We  begin  by  first  making  the  foUowing  observations.  For  any  dual-subtree,  Q,*, 
(Qi')*  =  Q|.  A  composite  tree  created  by  joining  all  of  the  Q/s  created  this  way  must  be 
equivalent  to  Q  since  each  Q,*  was  formed  from  a  dual  tree,  Q*,  having  the  property  that 
(Q*)'  =  Q,  and,  the  operations  of  'partitioning'  and  "joining',  which  may  be  regarded  as 
'inverse'  operations,  in  no  way  destroys  this  fundamental  property.  Also,  considering  Q, 
and  Q,*,  a  tree  and  its  dual,  we  can  use  Beckman's  (1980)  method  (i.e.,  steps  3a,  3b,  and 
3c  of  our  procedure)  to  obtain  an  equivalent  tree  for  Q,,  and,  furthermore,  this  equivalent 
tree  is  guaranteed  to  have  a  minimum  number  of  states  or  nodes. 

The  first  part  of  our  claim  proposes  that  Q",  the  composite  tree,  is  equivalent  to 
Q.  Since  we  formed  the  composite  tree  by  joining  the  subtrees,  Q,"  on  some  common 
node  between  them  like,  for  example,  the  'null'  node,  it  follows  that  Q"  is  equivalent  to 


136 

Q  since  we  can  form  a  composite  tree  equivalent  to  Q  hy  joining  all  of  the  subtrees,  Q,, 

on  some  common  node  between  them;  and,  Q,"  is  equivalent  to  QJor  every  I  e  L. 

The  second  part  of  our  claim  states  that  each  tree,  Q,",  contains  the  smallest 
number  of  nodes  of  any  tree  equivalent  to  (Q*)*.  A  key  assumption  for  our  claim  is  that 
the  classes  are  mutually  disjoint  in  the  tree.  Essentially,  we  can  verify  this  part  of  our 
claim  using  the  observations  previously  described.  Recall  that  Q,  is  a  tree  equivalent  to 
(Qj*)*,  and,  (Q,*)*  and  Q,*  represent  a  tree  and  its  dual.  Since  steps  3a,  3b,  and  3c  are 
primarily  the  same  as  Beckman's  (1980)  method,  assuming  Beckman's  claim  is  true,  we 
are  guaranteed  that  Q,"  will  have  the  smallest  number  of  states  or  nodes  of  any  tree 
equivalent  to  Q,.    This  result  verifies  our  claim  since  Q,  =  (Q*)'. 

We  begin  a  proof  of  our  general  claim  by  observing  that  DUALTREE  produces 
a  list  of  the  features  or  monotone  monomials  used  in  Q".  Note  that  the  number  of 
features  on  the  list  is  equal  to  the  sum  of  the  number  of  features  in  each  Q,"--less 
duplications—since  Q"  is  formed  by  'joining'  these  subtrees.  Our  general  claim  holds  iff 
Q"  has  the  smallest  number  of  nodes  of  any  tree  equivalent  to  Q,  assuming  mutually 
disjoint  classes  appear  in  the  trees.   We  verify  this  result  using  the  contrapositive. 

Suppose  there  is  an  equivalent  tree  for  Q  having  a  smaller  number  of  nodes  than 
Q".  If  the  classes  of  this  tree  are  mutually  exclusive,  then  for  a  given  1  e  {0,1 },  the  tree 
contains  a  subtree  that  is  equivalent  to  Q,  which  also  contains  a  fewer  number  of  nodes 
than  Qi".  We  have  already  showed  that  Q,"  contains  the  smallest  number  of  nodes  of  any 
tree  equivalent  to  Q,.  Also,  since  the  inverse  operations  of  'partitioning'  and  'joining' 
do  not  introduce  or  remove  any  part  of  the  'composite'  or  'component'  tree-structures, 


137 
(i.e.,  changing  the  number  of  nodes)  it  follows  that  our  supposition  is  false.   Hence,  Q" 

has  the  smallest  number  of  states  of  any  tree  equivalent  to  Q. 

The  results  of  this  section  apply  to  binary  decision  trees.   We  would  also  like  to 

use  our  procedure  with  nominal  and  continuous  data  as  well.  The  next  section  concludes 

our  discussion  of  feature  construction  using  DUALTREE  by  describing  several  extensions 

of  our  framework. 

4.8.3   Extensions  for  DUALTREE 

Up  to  now  we  have  not  considered  using  nominal  and  continuously  valued 
attributes  primarily  because  our  time  complexity  model  is  based  on  using  binary  data. 
Since  the  BEBR  business  surveys  contain  binary,  nominal,  as  well  as  continuously  valued 
attributes,  our  empirical  model  must  be  able  to  work  with  nominal  and  continuous  data. 
At  this  juncture,  a  rather  straightforward  solution  is  to  simply  'binarize'  the  nominal  and 
continuous  data-sets.  This  amounts  to  'translating'  or  'mapping'  the  data-set  into  a  binary 
data-set  by  re-coding  the  nominal  and  continuous  information  into  a  binary  format  that 
is  equivalent  to  the  initial  data.  The  following  section  describes  our  approach. 
4.8.3.1    Binarizing  nominal  and  continuous  data 

Nominal  attributes  are  attributes  that  have  a  fixed  number  of  discrete  values.  For 
example,  binary  attributes  are  nominal  attributes  that  have  only  two  values— 0  and  1,  yes 
and  no,  positive  and  negative,  etc.  Our  rather  straightforward  approach  for  mapping  a 
nominal  data-set  into  a  binary  one  simply  amounts  to  creating  a  binary  attribute,  (i.e.,  a 
binarized  attribute),  for  each  attribute-value  pair.  The  'binarized'  attribute  has  the  value 


138 
of  T   if  its  respective   'value'   is  assigned   to  the   nominal   attribute:   otherwise  the 

'binarized'  attribute  has  a  value  of  '0'.    Using  this  approach,  the  number  of  'binarized' 

attributes  introduced  depends  on  the  number  of  nominal  attributes  as  well  as  the  number 

of  discrete  values  for  each  nominal  attribute. 

Continuous  attributes  have  numeric  values  that  are  either  integers  or  floating  point 

numbers.    We  cannot  use  the  approach  for  'binarizing"  nominal  data  since  continuous 

attributes  may  have  an  infinite  number  of  values.     C4.5  contains  rules  for  finding 

appropriate  thresholds  against  which  to  compare  the  values  of  continuous  attributes.    If 

attribute  A  has  continuous  numeric  values,  C4.5  uses  binary  tests  in  the  tree  it  builds  that 

are  based  on  comparing  the  value  of  A  against  a  threshold  value  Z--namely,  A  <  Z  and 

A  >  Z.  Also,  C4.5  employs  mechanisms  which  ensure  that  all  threshold  values  appearing 

in  the  trees  it  produces  actually  occur  in  the  data.     Our  approach  for  'binarizing' 

continuous  data  amounts  to  using  the  binary  test  produced  by  C4.5.     We  create  a 

'binarized'  attribute  for  every  threshold-based  test  appearing  in  the  input  decision  tree  for 

DUALTREE.   In  our  approach,  all  of  the  'binarized'  attributes  formed  this  way  uses  the 

'<■  operator  in  the  test.  For  example,  this  means  that  (A  <  Z)  is  a  feature  allowed  by  our 

model,  while  (A  >  Z)  is  not.   Note  that  a  conjunct  of  threshold-based  features  attributes 

such  as  these  may  still  be  evaluated  in  the  normal  way.   The  number  of  threshold-based 

features  introduced  depend  on  the  number  of  threshold-based  test  appearing  in  the  input 

tree. 


139 
4.8.3.2   Forming  features  with  binarized  data 

DUALTREE  employs  certain  selection  criteria  for  removing  some  of  the  features 
from  the  list  processed  in  step  5  of  the  procedure.  This  action  resolves  various  problems 
associated  with  using  a  binarized  form  of  the  data.  Lopez  De  Mantaras  (1991)  reports 
that  binarized  trees  have  the  problem  of  being  more  difficult  to  interpret.  We  focus  on 
a  single  problem  related  to  using  a  binarized  form  of  nominal  and  continuous  data.  Our 
selection  criteria  simply  removes  any  feature  from  the  list  that  is  considered  'unuseful' 
because  of  certain  manifestations.  We  describe  each  type  of  'unuseful'  feature  in  the 
following  discussion. 

Recall  that  DUALTREE  forms  features  indiscriminantly.  Hence,  considering 
'binarized"  nominal  attributes  or  features,  DUALTREE  may  form  a  feature  that  is  a 
conjunct  involving  two  or  more  discrete  values  for  the  same  attribute!  Since  we  never 
encounter  an  instance  where  an  attribute  has  more  that  one  value  simultaneously, 
DUALTREE  disregards  features  such  as  these. 

The  threshold-based  features  formed  by  DUALTREE  may  contain  several 
thresholds  for  the  same  attribute.  For  example,  the  feature  [(A  <  3)  &&  (A  <  9)  &&  (A 
<  18)]  is  regarded  by  DUALTREE  as  a  feature  of  size  three,  however,  this  feature  is 
'equivalent"  to  [(A  <  3)],  a  feature  of  size  one.  For  now,  DUALTREE  does  not  employ 
any  type  of  mechanism  for  finding  'equivalent'  features  such  as  in  the  previous  example. 
Instead  DUALTREE  simply  disregards  any  threshold-based  feature  having  more  than  one 
threshold-based  test  for  the  same  attribute.  This  action  ensures  that  the  size  of  any  feature 


140 
will  be  no  greater  than  the  number  of  primitive  attributes  appearing  in  the  input  decision 

tree. 


CHAPTER  5 
DUALTREE's  DESIGN  AND  IMPLEMENTATION 


This  chapter  describes  a  program  based  on  our  DUALTREE  feature-construction- 
framework  and  the  hypothesis  proposed  in  the  previous  chapter.  Recall  that  we  require 
a  practical  procedure  for  constructing  useful  feature-sets  of  minimal  cardinality.  Our 
DUALTREE  algorithm  produces  a  reduced  number  of  states  for  a  finite  automaton. 
Hopcroft  (1971)  describes  an  algorithm  for  minimizing  the  states  of  finite  automata  in 
which  the  asymptotic  running  time  in  a  worst  case  analysis  grows  as  n  log  n  where  n  is 
the  number  of  states.  The  proposed  algorithm  reduces  overall  computation  time  by  using 
linked-list  data  structures  and  list  processing  routines.  Essentially,  the  algorithm  works 
as  follows:  (1)  Create  a  'state'  table  representing  the  automaton;  (2)  Invert  the  state  table; 
(3)  Partition  the  states  in  the  inverted  table  according  to  their  outputs;  (4)  Select  a  'block' 
and  an  input  symbol  on  which  to  refine  the  partition;  (5)  When  no  further  refinement  is 
possible,  all  states  in  the  same  block  of  the  partition  can  be  shown  to  be  equivalent.  The 
total  number  of  steps  in  the  algorithm  is  bounded  by  n  log  n,  and  the  author  validates  this 
bound  for  the  algorithm.  Note  that  'inverting'  the  state  table  is  analogous  to  forming  the 
'dual'  of  a  binary  decision  tree. 

We  encoded  DUALTREE  using  ANSI  Standard  C-i-i-.  We  tested  the  program 
using  the  Borland  C/C-i-i-  compiler.   C++  is  particularly  well-suited  for  our  work  because 


141 


142 

its  basic  support  for  data  abstraction  and  modular  programming  allows  us  to  clearly 

express  relationships  among  algorithms  and  data  structures. 

We  represent  the  trees  and/or  automatons,  or  finite-state  machines,  using  graph 
data  structiu-es.  Linked  data  structures  are  actually  a  representation  of  a  graph.  Many  of 
the  algorithms  we  use  for  processing  graphs,  (along  with  analyses  of  the  algorithms),  are 
found  in  Sedgewick  (1992).  The  fundamental  dependence  of  automata  topology  on  two 
parameters— V^(ertices)  and  £(dges)— makes  a  comparative  study  of  algorithms  somewhat 
more  complicated  because  more  possibilities  arise.  For  example,  for  sparse  trees  or 
graphs  with  relatively  few  edges  (say  less  than  V  log  V),  one  algorithm  may  take  about 
V"^  steps,  while  another  algorithm  for  the  same  problem  may  take  (E  +  V)  log  E  steps. 
The  second  algorithm  would  be  better  for  sparse  trees,  however,  the  first  may  be  preferred 
for  'dense'  trees  or  graphs  (Sedgewick  1992). 

Following  is  a  listing  of  the  C++  source-code  that  implements  DUALTREE: 

//  Member  Function  // 

void      Dualtree::Initialize(char  *FileName) 

// 

//  Performs  steps  for  DUALTREE  Feature  Construction 

// 

//  Parameters 

//      FileName  is  the  file  name  for  the  C4.5 

//  'names'  and  'data'  files 

{ 

InterfaceTree.InitiaUze(FileName); 

//  read  the  edges  of  the  given  tree 

InterfaceTree.ConstructDual(DUALADJ); 

//   get  the  adjacency  matrix 

//  Building  the  Class  Successors... 

for  (els  =  adjcstart;  els  <=  adjcstop;  cls++) 


143 
BuildClassSucc(cls); 

//  Finding  the  Terminal  Features... 

for  (els  =  adjcstart;  els  <=  adjcstop;  cls++) 
FindTerminals(els); 

//  Building  the  Dual  of  the  Dual... 

for  (els  =  adjcstart;  els  <=  adjestop;  cls++) 
BuUdDualofDual(els): 

//  Finding  the  New  Features... 

FindNewFeatures(CLASSList[(els-adjestart+l)]); 

UpdateNames/DataFilesO; 
};  //  end  member  function  Initialize 

We  see  that  the  function  primarily  has  seven  key  parts.  First,  we  read  an  input 
file  and  next  get  an  adjacency  matrix  representing  a  tree.  These  two  steps  function  to 
form  the  dual  of  a  binary  decision  tree.  The  next  two  parts,  'BuildClassSucc(cls)'  and 
'FindTerminals(els)',  perform  the  task  of  DUALTREE  feature  construction  on  the 
structure  representing  the  dual  of  a  binary  decision  tree.  The  argument  for  the  functions 
is  the  class  or  category  which  is  a  root  of  a  depth-first  search  tree. 
'BuildDualofDual(els)'  forms  the  'dual  of  the  dual',  as  described  in  the  previous  chapter. 
'FindNewFeatures'  is  a  function  that  performs  DUALTREE  feature  construction  on  the 
structure  representing  the  dual  of  the  dual.  Finally,  we  update  the  '.names'  and  '.data' 
files  for  C4.5  so  that  we  can  obtain  a  decision  tree  using  the  new  features.  We  add  each 
feature  to  the  list  of  primitive  attributes  in  the  initial  '.names'  file,  and  its  corresponding 
value  to  the  '.data'  file. 

In  order  to  process  graphs  with  a  computer  program,  we  first  must  decide  how  to 
represent,  build,  and  search  them  within  the  computer.   The  next  sections  describe  how 


144 
DUALTREE  represents  and  searches  the  graphs  it  processes.     TTiis  is  followed  by  a 

discussion  of  the  'InitializeO'  class-member  function  described  previously.  We  use  results 

from  analysis  of  algorithms  found  in  Sedgewick  (1992)  where  appropriate. 

5.1    DUALTREE's  Representation  Model 

Our  first  step  in  representing  a  decision  tree  is  to  map  the  feature  or  attribute 
names  to  unique  integers.  Essentially,  we  formulate  the  problem  in  terms  of  objects  and 
connections  between  them.  DUALTREE  uses  a  String  object  to  store  character  values 
representing  both  features/attributes  and  the  connections.  A  'class- function'  provided  for 
string  objects  is  for  translating  its  character  value  to  unique  integers.  Using  integers 
makes  it  possible  to  quickly  access  information  corresponding  to  each  vertex,  using  array 
indexing.  Our  implementation  uses  a  hash  function  based  on  Horner's  method  to  compute 
hash  values  for  alphanumeric  strings. 

DUALTREE  uses  an  adjacency -structure  representation  for  a  graph.  For  this 
representation,  all  of  the  vertices  connected  to  each  vertex  are  listed  on  an  adjacency  list 
for  that  vertex.  We  build  the  link  lists  using  an  artificial  node  z  at  the  end,  which  points 
to  itself.  The  artificial  nodes  for  the  beginning  of  the  lists  are  kept  in  an  array  adj[] 
indexed  by  vertex.  The  /th  vertex  read  is  assigned  the  integer  /.  To  add  a  'du-ected'  edge 
where  j  is  a  successor  for  x  with  a  given  function-value,  we  add  v  to  x's  adjacency  list 
and  store  the  function-value  in  a  field  for  the  edge-value  in  adjacency  list  records. 

The  storage-space  requirement  for  the  adjacency  list  representation  is  0(V  +  E), 
where  V  is  the  number  of  vertices  and  E  is  the  number  of  edges  (Sedgewick  1992).  Also, 


145 
the  same  graph  can  be  represented  many  different  ways  using  the  adjacency-structure 
representation.  This  is  because  the  order  in  which  the  edges  appear  in  the  input 
determines  the  order  in  which  the  vertices  appear  on  the  adjacency  lists. 

DUALTREE  requires  a  binary  decision  tree  to  begin  feature  construction.  We 
obtain  our  initial  tree  using  C4.5  (Quinlan  1993).  This  input  file  for  DUALTREE  consist 
of  the  predecessor-function-value-successor  triples,  one  per  line,  along  with  additional 
information  used  by  DUALTREE.  The  next  section  describes  how  DUALTREE  reads 
and  stores  this  information  using  the  adjacency- structvu-e  representation. 

5.1.1    DUALTREE's  Adjacency-Structure 

DUALTREE  uses  the  function,  ReadName(file,  char-array),  to  read  character 
values  from  a  specified  file  and  store  them  in  the  character  array  passed  in  the  argument 
list.  These  character  values  are  used  to  create  string  objects  whose  hash  values  are  used 
by  DUALTREE  when  processing  them.  We  assign  an  id  number  to  each  unique  vertex. 
Appendix  B  shows  the  C++  source-code  listing  for  several  major  code-blocks  of  the 
program.  The  code-block  shown  in  Appendix  B  for  Newface::Initialize(char  *fn)  is  the 
C++  code  we  developed  for  reading  the  'edges'  file  given  by  C4.5. 

A  String  object  is  created  for  each  element  of  the  triple.  Since  the  input  file 
contains  edges  of  a  binary  decision  tree,  each  predecessor— function-value  pair  determines 
one  and  only  one  successor,  hence,  to  see  if  an  edge  has  akeady  been  added,  (i.e.  trees 
with  the  replication  problem  contain  duplicate  edges),  we  only  need  to  check  the  first  two 
elements  of  the  triples.    DUALTREE  uses  a  radix-searching  method  called  digital  tree 


146 

searching,  to  search  for  the  existence  of  attribute  names  in  the  digital  tree  used  for  storing 

this  information.   The  next  section  describes  this  method. 
5.1.2   Searching  and  Sorting  Feature  Names 

DUALTREE  uses  searching  and  sorting  methods  based  on  the  search  key's  'bits' 
in  the  machine  or  hardware.  C++  offers  low  level  operators  that  make  the  bits  of  search 
keys  easily  accessible.  These  methods  are  called  Radix  Searching/Sorting  Methods.  Key 
advantages  of  these  methods  include  they  can  provide  very  fast  access  to  data;  and  they 
provide  reasonable  worst-case  performance.  A  disadvantage  is  that  character  data,  which 
is  biased,  can  lead  to  degenerate  trees  with  bad  performance  (Sedgewick  1992). 

In  digital  tree  searching  we  branch  in  a  tree  according  to  the  key's  bits.  At  the 
first  level  we  use  the  leading  bit.  We  use  the  second  leading  bit  at  the  second  level,  and 
so  on  until  an  external  node,  or  z  node,  is  reached.  No  path  in  the  tree  will  be  any  longer 
than  the  number  of  bits  in  the  keys,  and  the  length  of  the  longest  path  in  a  digital  search 
tree  is  the  length  of  the  longest  match  in  the  leading  bits  between  any  two  keys  in  the  tree 
(Sedgewick  1992).    Additionally,  we  have  the  following  result: 

PROPERTY  17.1  [Digital-Tree-Search,(Sedgewick  1992)] 

A  search  or  insertion  in  a  digital  search  tree  requires  about  Ig  A^ 
comparisons  on  the  average  and  b  comparisons  in  the  worst  case  in  a  tree  built 
from  N  random  b-hit  keys.    (247) 

DUALTREE  uses  an  unsigned  long  integer  for  each  key,  hence  the  algorithm  makes  32 

comparisons  in  the  worst  case  for  searches  or  insertions  in  the  trees  used  for  storing  both 

the  primitive  attribute  names  and  the  names  of  newly  formed  features.    The  range  of 


147 
numbers  for  a  unsigned  long  int  is  0  to  4,294,967,295.   No  string  object's  hash  value  is 

larger  than  this  number. 

Radix-sorting  methods  operate  on  keys  that  are  binary  numbers  from  some 
restricted  range,  for  example,  32  bits.  These  types  of  algorithms  treat  the  keys  as 
numbers  represented  in  a  base-M  number  system,  (i.e.  M  =  2),  and  work  with  individual 
digits  of  the  numbers.  DUALTREE  uses  a  method  called  radix  exchange  sort,  which 
examines  the  bits  in  the  keys  from  left  to  right  and  manipulates  the  records  in  a  manner 
similar  to  Quicksort.  Sedgewick  (1992)  shows  that  the  running  time  of  radix  exchange 
sort  for  sorting  A'  records  with  ^-bit  keys  is  essentially  Nb.  We  also  have  the  following 
property: 

PROPERTY  10.1  [Performance  of  Radix  Sort,(Sedgewick  1992)] 

Radix-exchange  sort  examines  about  N  Ig  N  bits,  on  average.   (141) 

Recall  that  features  are  conjuncts  of  the  primitive  attributes.  Each  primitive 
attribute  has  a  key  for  its  record  which  is  the  hash  value  of  the  string  object  representing 
it.  DUALTREE  'sorts'  the  attribute-keys  comprising  each  feature  using  radix  exchange 
sort  before  any  processing  is  done  using  the  feature.  Adopting  this  method  allows  us  to 
easily  check  two  features  for  equality  regardless  of  the  order  or  position  of  the  primitive 
attributes  comprising  them.  Duplicate  edges  are  not  stored,  hence,  the  replication  problem 
does  not  impact  the  basic  storage  requirement  for  the  edges  of  a  tree. 


148 

5.2   Graph  Processing  in  DUALTREE 

DUALTREE  uses  a  depth-first  search  technique  to  'visit'  every  node  and  check 
every  edge  in  a  graph  systematically.  An  array,  val[V],  is  used  to  record  the  order  in 
which  the  vertices  are  visited.  Initially,  the  array  has  a  value  of  'unseen'  for  each  entry. 
The  goal  is  to  systematically  visit  all  the  vertices  of  a  graph,  setting  the  val  entry  for  the 
ith  vertex  visited  to  some  number  other  than  'unseen'.  Given  a  starting  point,  depth-first 
search  wends  its  way  through  the  graph,  storing  on  the  stack  the  points  where  other  paths 
branch  off.  Essentially,  we  associate  a  depthfirst  search  tree  with  each  connected 
component.  Traversing  the  tree  in  "preorder',  for  example,  gives  the  vertices  of  the  graph 
in  the  order  they  are  first  encountered  by  the  search.  This  'forest"  of  depth-first  search 
trees  is  DUALTREE's  way  of  representing  a  graph.  Additionally,  we  have  the  following 
property: 

PROPERTY  29.1  [Depth-First  Search  Performance,(Sedgewick  1992)] 

Depth-first  search  of  a  graph  represented  with  adjacency  lists  requires  time 
proportional  to  V  -i-  £.    (426) 

A  graph  exhibits  the  'replication'  problem  if  and  only  if  a  node  that  is  not  unseen 

is  discovered  in  a  visit.  In  other  words,  if  we  encounter  an  edge  pointing  to  a  vertex  that 

we've  already  visited,  then  the  depth-first  search  tree  associated  with  the  'seen'  vertex  is 

'replicated'  in  the  graph.  Consider  traversing  a  graph  and  letting  the  number  of  times  we 

visit  a  vertex  that  has  already  been  seen  be  equal  to  the  number  of  cycles  in  the  graph. 

We  can  say  that  the  degree  or  extent  to  which  the  replication  problem  is  manifested  in 

the  graph  is  represented  by  the  number  of  cycles  in  the  graph. 


149 

5.3   Feature  Construction  with  DUALTREE 

The  source-code-listing  for  implementing  DUALTREE  shown  previously  primarily 
consist  of  seven  steps.  The  first  two  steps  perform  the  tasks  of:  (1)  reading  in  the  edges 
of  a  tree,  (2)  building  the  adjacency-structure  representation  of  the  tree,  and,  (3) 
constructing  the  'dual'  of  the  input  tree.  Newface  is  a  class  object  in  DUALTREE  that 
essentially  contains  procedures  for  processing  the  'edges'  file  given  by  C4.5.  This  class 
contains  public  functions  to  process  queries  regarding  the  input  tree  and  its  primitive 
attributes.  Additionally,  this  class  also  fills  an  array  that  represents  the  'dual'  of  the 
adjacency  matrix  representing  the  initial  tree.  The  source-code  listing  for  filling  the  array 
also  found  in  Appendix  B.  Please  observe,  in  Appendix  B,  the  listing  for 
Newface: :ConstructDual(  ).  In  this  recursive  implementation  of  'DualVisit(array,  id)',  to 
DualVisit  a  vertex,  means  checking  all  its  edges  to  see  if  they  lead  to  vertices  that  haven't 
yet  been  seen;  if  so,  we  DualVisit  them.  This  amounts  to  creating  a  node  containing  x 
on  >''s  adjacency  list  for  each  corresponding  edge-value.  We  fill  the  array  with  this  'dual' 
information  as  we  examine  each  edge. 

Since  each  edge  is  examined  once  and  only  once,  the  running  time 
'ConstructDualO'  is  linear  in  the  number  of  edges.  To  summarize  the  complexity  results 
for  accomplishing  the  steps  we've  discussed,  let  V  and  E  be  the  number  of  vertices  and 
edges  of  a  tree.  Let  A'  equal  the  number  of  keys  used  to  represent  the  set  of  vertices. 
Note  that  N  is  bounded  by  V,  (i.e.  the  replication  problem  gives  a  'duplication'  of  vertices 
or  edges).  Storage  space  for  the  adjacency-structure  representation  is  0(V  +  E).  The 
mput  file  contains  one  edge  per  line.    Since  DUALTREE  requires  edges  for  input,  the 


150 
running  time  for  reading  the  input  file  is  linear  in  E.    Each  search  or  insertion  in  the 

digital  tree  containing  records  for  each  primitive  attribute  found  in  the  input,  requires  Ig 

N  comparisons  on  the  average  and  32  comparisons  in  the  worst  case.   The  running  time 

for  'Constructing  the  Dual'  is  linear  in  E. 

With  the  completion  of  the  second  step,  DUALTREE  has  essentially  performed 

step  one  of  the  steps  given  in  the  DUALTREE  Construction-Framework  described  in  the 

previous  chapter.    The  following  sections  describe  how  the  program  accomplishes  the 

remaining  steps  of  the  procedure.    We  also  discuss  the  running  time  for  key  program 

blocks  and  conclude  with  a  discussion  of  the  overall  running  time  for  DUALTREE. 

5.3.1    Building  Class  Successors 

The  Class  member  function,  'BuildClassSucc(id)',  basically  finds  all  possible 
successors  for  a  given  root  node  in  the  'dual'  tree.  These  nodes  are  leaves  or  classes  in 
the  initial  tree.  Hence,  this  function  is  called  once  for  each  class-name  used  as  a  label 
for  a  leaf  node  in  the  input  decision  tree.  For  the  given  class-name  id,  'BuildClassSuccO' 
first  creates  the  successor  for  each  value  of  the  function,  (i.e.  '0'  and  '1').  It  does  this 
by  traversing  the  dual-adjacency  lists  starting  at  the  indexed  vertex  and  checking  the 
edge-value  field  for  each  node  on  its  list.  If  the  edge-value  is  T,  then  the  primitive 
attribute  is  added  to  the  set  of  attributes  for  the  'One-Value-Successor-Feature'. 
Otherwise  it  is  part  of  the  'Zero-Value-Successor-Feature'.  The  source-code-listing  for 
this  task  is  also  given  in  Appendix  B  for  the  function  Dualtrec::BuildClassSucc(  ).  The 
running  time  required  for  processing  a  list  is  bounded  to  A^.  A  feature  can  contain  at  most 


151 
A'  of  the  primitive  attributes  in  the  input  file.  This  number  may  not  be  equal  to  the 
number  of  primitive  attributes  found  in  the  'names'  file  used  by  C4.5  to  build  the  initial 
decision  tree. 

Once  we  have  created  the  successors  for  a  class  and  the  function  values,  we  then 
sort  the  features,  get  hash-values  for  each  of  the  features,  and  search  the  digital  tree  to 
see  if  we  have  previously  seen  the  features.  If  the  features  are  new  features,  we  assign 
'id'  numbers  to  them  and  then  insert  records  for  the  features  in  a  digital  tree.  We  next 
add  the  features  to  an  adjacency-structure  representing  the  'dual'  of  the  decision  tree  after 
we  perform  DUALTREE-Feature-Construction.  The  source-code  listing  for  accomplishing 
this  task  is  found  in  Appendix  B  under  the  heading  C++  Code  for  Adding  New  Features. 
This  code-block  does  searches  and  insertions  using  a  digital  search  tree.  It  is  executed 
twice  for  each  class  using  the  'One-Value-Successor-Feature'  and  the  'Zero-Value- 
Successor- Feature'.  Each  search  or  insertion  now  requires  Ig  (J)  comparisons  on  the 
average,  where  j  is  the  number  of  new  keys  associated  with  the  newly  formed  features. 
32  comparisons  are  still  required  in  the  worst  case.  Features  are  sorted  before  we 
determine  their  hash  values.  The  amount  of  time  required  for  sorting  the  feature 
fundamentally  depends  on  the  'feature-size'.  Let  s  be  the  size  of  the  feature  we  are 
sorting.  Using  32  bit  keys,  sorting  a  feature  examine  s  Ig  s  bits  on  average  and  32s  bits 
in  the  worst  case. 

Finally,  'BuildClassSuccO'  makes  two  calls  to  the  recursive  function, 
'FindFeatures(id)',  once  for  each  of  the  Class  Successors.    'FindFeaturesO'  recursively 


152 

builds  the  depth-first  search  tree  for  a  given  vertex  id  number.    We  analyze  the  running 

time  of  this  function  in  the  next  section. 

5.3.1.1    Finding  features 

TindFeaturesO'    is   a   function   structured   almost   identically   to   the   function 

'BuildClassSuccO',  except  that  it  makes  recursive  calls  to  itself  using  'Zero-Value-List' 

and  'One- Value-List'  as  arguments.   This  function  essentially  builds  a  subtree  rooted  at 

the  given  feature.    'FindFeaturesO'  primarily  performs  the  following  steps  for  a  given 

feature  F. 

Step  1.  For  each  v  €  {0,1}     DO 

Step  2.  F'  =  FindSuccessor(f  ,v) 

Step  3.  found  =  SearchTree(f"')  (1  if  in  tree) 

Step  4.  IF  (F'  ^  null  AND  nolifound))   THEN 

Step  4a.  InsertlnTree(F') 

Step  4b.  FindFeatures(F') 

The  last  step  is  the  recursive  call.  Essentially,  these  steps  form  a  simple  loop.  The  time 
necessary  to  traverse  the  loop  for  a  given  \'  and  an  initial  feature  is  proportional  to  the 
number  of  edges  in  the  dual  tree  whose  successor  for  the  given  predecessor-function- 
value  pair  is  not  the  null  node.  Note  that  we  perform  searches  and  insertions  using  a 
digital  search  tree  containing  a  record  for  each  of  the  features  formed  by  DUALTREE. 
Let  the  total  number  of  features  formed  by  DUALTREE  be  equal  to  /  We  will  execute 
the  loop  statements  at  least ;  times.  Only  the  first  four  steps  are  executed  for  duplicate 
features.   Next,  let  us  consider  the  time  spent  in  Step  2  through  Step  4b  for  a  given  value 

V. 

Step  2  partitions  a  feature  on  a  given  value  of  the  function.  This  results  in  another 
feature  which  is  the  successor  for  F  given  v.   Given  v,  the  time  necessary  to  partition  a 


153 
feature  is  proportional  to  the  number  of  edges  in  the  dual  tree  having  one  of  the  primitive 

attributes  comprising  the  feature,  as  a  predecessor  in  the  triple  representing  an  edge  in  the 

dual  tree.   This  is  a  finite  number  since  there  are  only  a  finite  number  of  such  edges  in 

the  dual  tree.   A  feature  contains  at  most  k  primitive  attributes,  where  A:  is  the  number  of 

attribute-names  used  in  the  initial  decision  tree.    The  time  for  partitioning  a  feature 

depends  on  the  feature  size  as  well  as  the  specific  attributes  comprising  the  feature.   Let 

us  denote  this  finite  running  time  by  tp.    Thus,  each  feature  has  its  own  tp.    However, 

using  array  indexing  for  our  processes  makes  t^  negligible. 

We  perform  a  search  of  the  digital  search  tree  in  step  3.  Note  that  before  the 
search  begins,  we  check  to  see  if  F'  is  the  null  node.  If  it  is  then  no  search  takes  place. 
Next  we  sort  F'  and  then  find  its  hash  value  which  we  use  as  the  key  for  searching  the 
digital  tree's  records.  The  variable /oMncf  is  set  to  '  1'  if  a  record  for  the  feature  is  already 
in  the  tree,  and  '0'  otherwise.  Given/,  each  search  requires  Igy  comparisons  on  average 
and  32  comparisons  in  the  worst  case.  The  size  of  F'  determines  how  long  it  takes  to  sort 
it.  The  feature-size  is  bounded  by  k,  hence,  a  sort  will  examine  32k  bits  in  the  worst 
case. 

Step  4  shows  the  'termination'  conditions  for  the  recursive  program.  If  we  have 
formed  the  null  feature  or  a  previously  formed  feature,  then  the  recursion  ends. 
Otherwise  a  new  featiu^e  is  inserted  in  the  digital  tree,  then  the  recursive  call  is  made 
using  the  new  feature  as  the  argument.  A  basic  characteristic  of  a  recursive  program  is 
that  it  calls  itself  with  a  smaller  value  of  its  argument.  In  our  case,  each  new  feature 
corresponds  to  a  partition,  or  subspace,  of  the  feature  space  associated  with  the  initial 


154 

feature.  The  function-call  is  made  twice,  once  for  each  value  of  the  function.  Tlius,  no 
more  than  2j  calls  are  made  to  the  function.  Insertions  in  the  digital  tree  require  32 
comparisons  in  the  worst  case  and  Ig  j  comparisons  on  the  average. 

Summarizing,  the  time  required  to  build  the  class  successors  and  find  the  new 
features  is  bounded  by  some  constant  times  the  sum  of  the  product  of  the  number  of 
classes  and  A^,  and,  j.  The  multiplier  is  largely  determined  by  the  number  of  bits  used  for 
the  keys.  For  example,  if  we  use  16  bits  for  the  keys,  the  value  of  the  multiplier  would 
be  lower. 

We  have  now  completed  the  second  stage  of  our  basic  four-stage  DUALTREE- 
Feature-Construction  model.  We  refer  to  the  features  we  have  formed,  at  this  juncture, 
as  Stage2Features  in  the  sequel.  Note  that  a  Stage2Feature  can  in  fact  be  one  of  the 
primitive  attributes  found  in  the  input  file.  From  a  comparative  standpoint,  Hopcroft's 
algorithm  (1971)  terminates  at  this  point.  Our  model  proceeds  by  next  constructing  the 
'dual  of  the  dual'  and  then  performing  DUALTREE  feature  construction.  These  steps  are 
essentially  a  repetition  of  the  steps  we  have  previously  discussed.  Thus,  from  a  time 
analysis  point  of  view,  it  appears  that  we  simply  need  to  add  a  factor  of  2  to  the  time 
requirement  established  previously.  We  describe  the  functions  for  performing  the 
remaining  steps  in  the  following  sections. 

5.3.2    Building  the  Dual  of  the  Dual 

The  next  step  in  our  framework  calls  for  building  the  'dual'  of  the  'dual- 
adjacency- structure'  formed  previously.     The  'dual-adjacency-structure'  represents  an 


155 
'equivalent'  graph  after  we  have  performed  DUALTREE  feature  construction  using  the 

initial  dual  adjacency  structure  created  with  the  primitive  attributes  of  an  initial  decision 
tree.  Recall  that  in  forming  the  dual,  root  nodes  become  terminal  nodes  and  terminal 
nodes  become  root  nodes.  Root  nodes  in  the  dual  graph  consist  of  class-labeled  vertices. 
Hence,  in  the  graph  representing  the  'dual  of  the  dual',  the  terminal  leaves  have  class- 
names  for  labels,  as  found  in  the  input  decision  tree.  A  'terminal'  feature  is  a  feature 
containing  a  primitive  attribute  that  is  a  'terminal'  attribute  in  the  initial  adjacency- 
structure.  The  number  of  roots  or  'start'  nodes  in  the  graph  representing  the  'dual  of  the 
dual'  is  equal  to  the  number  of  terminal  features  in  the  'equivalent'  graph  previously 
formed. 

The  recursive  function,  'BuildDualofDual(id)',  creates  an  adjacency  structure 
consisting  of  a  forest  of  depth-first  search  trees  for  a  given  vertex  id.  'BuildDualofDual' 
is  coded  almost  identically  to  the  recursive  function  DualVisit.  A  difference  between  the 
two  functions  is  that  'BuildDualofDual'  fills  a  two  dimensional  array  that  essentially 
represents  a  separate  graph  for  each  of  the  classes.  A  listing  for  'BuildDualofDual'  is 
found  in  Appendix  B.  Writing  the  'dual  of  the  dual'  information  amounts  to  creating  a 
node  containing  v  on  x's  adjacency  list  for  each  corresponding  edge-value.  Note  that  the 
'dual'  information  consists  of  a  node  containing  x  on  _y's  adjacency  list.  The  'initial' 
information  comprises  a  node  containing  y  on  x's  adjacency  list.  Thus.  DUALTREE 
produces  'equivalent'  information  using  the  adjacency-structure  representation.  The 
degree  of  equivalency  is  based  on  our  previous  discussion  concerning  the  correctness  of 
the  procedure,  found  in  the  previous  chapter. 


156 

The  running  time  of  'BuildDualofDual'  is  linear  in  the  number  of  edges  in  the 

graph  representing  an  equivalent  graph  for  the  dual  of  the  initial  graph  for  the  input  tree. 
The  sum  of  this  number  of  edges  and  the  number  of  Stage2Features,  times  some  constant, 
represents  a  bound  for  storage  space  of  the  graph. 

Before  building  the  'dual  of  the  dual',  DUALTREE  first  finds  the  terminal  features 
in  each  of  the  depth-first  search  trees  rooted  at  the  clasb-names  in  the  equivalent-dual- 
graph.   The  next  section  describes  how  this  is  done. 
5.3.2.1    Finding  terminal  features 

DUALTREE  creates  one  adjacency-structure  representing  an  equivalent  graph  for 
the  dual  graph,  using  DUALTREE  feature  construction.  For  this  adjacency-structure,  the 
algorithm  processes  a  forest  of  depth-first  search  trees  rooted  a  each  of  the  vertices 
labeled  with  a  class-name  found  in  the  input  file.  The  program  logic  essentially  works 
with  more  than  one  graph,  using  the  same  data  structure,  by  processing  each  of  the  depth- 
first  search  trees  rooted  at  the  class  nodes.  'FindTerminals(id)'  is  a  function  that  finds 
all  of  the  terminal  features  in  a  given  depth-first  search  tree  by  recursively  calling  itself 
to  traverse  the  adjacency  matrix  starting  at  the  root.  A  field  in  each  record  for  a  feature 
indicates  whether  the  feature  is  a  terminal  feature.  This  field  was  adjusted  by  the 
'FindFeatures'  routine  as  the  new  features  were  formed.  A  feature  is  a  terminal  feature 
if  one  of  its  primitive  attributes  is  a  leaf  in  the  adjacency  structure  representing  the  'dual' 
of  the  decision  tree.    This  attribute  is  also  a  root  of  the  input  decision  tree. 

'FindTerminalsO'  creates  a  list  of  terminal  features  for  a  class  using  the  statements 
shown    in    Appendix    B    for    Dualtree::FindTenninals(    ).       The    running    time    for 


157 
'FindTerminals'  is  linear  in  the  number  of  edges  in  the  dual-equivalent-graph.  This  is  the 

identical  set  of  edges  processed  by  'BuildDualofDualO'. 

5.3.3    Finding  New  Features 

When  performing  DUALTREE  feature  construction  using  the  'dual  of  the  dual', 
we  first  create  a  feature  which  is  a  conjunct  of  each  primitive  attribute  contained  in  the 
set  of  terminal  features  for  a  class.  In  essence,  we  are  processing  a  'list  of  lists',  or  'set 
of  sets'.  We  add  the  appropriate  logic  to  DUALTREE  to  store  feature-records  such  that 
for  any  feature,  each  attribute  comprising  the  feature  appears  at  most  once  when  listing 
the  feature's  attributes.  'FindNewFeatures(classId)'  is  a  function  that  recursively  builds 
a  depth-first  search  tree  rooted  at  a  feature  formed  using  the  terminal  features  for  a  class. 
'FindNewFeaturesO'  operates  almost  identically  to  'FindFeaturesQ'  discussed  previously. 
The  two  functions  differ  in  the  type  of  data  structure  they  process.  Instead  of  processing 
a  list  of  integers,  'FindNewFeaturesO'  processes  a  list  of  lists  of  integers.  Also,  instead 
of  updating  an  adjacency  array,  the  function  adds  every  feature  it  forms  to  a  list  for  these 
features.  If  a  feature  is  formed  which  has  been  seen  before,  (i.e.  a  Stage2Feature),  it  wiU 
be  on  the  list  of  features.  The  hst  of  features  will  contain  the  features  formed  using  the 
terminal  features  and  will  not  contain  any  features  having  a  size  of  1  (e.g.  a  primitive 
attribute).  We  refer  to  this  list  of  features  in  the  sequel  as  Stage4 Features.  Thus,  the 
number  of  features  formed  by  DUALTREE  is  equal  to  the  number  of  Stage2Features  plus 
the  number  of  Stage4 Features  minus  the  number  of  features  that  are  both  Stage2Features 
and  Stage4Features. 


158 
The  running  time  for  'FindNewFeaturesO',  for  the  most  part,  is  identical  to  that 

for  'FindFeaturesO'.  Assuming  that  tp  is  negligible,  each  search  or  insertion  still  requires 

Ig  j  comparisons  on  the  average  and  32  comparisons  in  the  worst  case,  where  j  is  the  total 

number  of  features  formed  by  DUALTREE. 

After  we  have  formed  the  new  features,  we  create  a  new  'names'  and  'data'  files 

for  use  with  C4.5  The  overall  time  required  for  forming  new  features  is  discussed  in  the 

next  section. 

5.4   DUALTREE' s  Time  Complexity 

Each  of  the  program  blocks  we  have  discussed  are  executed  once,  in  sequential 
order,  when  the  program  is  called.  The  complexity  results  for  the  blocks  include  the 
following:  0(V  +  £j;  0(E)\  0(CV  +  j);  0(32):  0(E'h  and  0(E").  C  is  the  number  of 
class-names  used  as  node-labels  in  the  input  decision  tree;  £"  is  the  number  of  edges  in 
the  adjacency-structure  representing  the  'dual'  graph  before  we  perform  DUALTREE 
feature  construction;  and  E"  represents  the  number  of  edges  in  the  adjacency-structure  for 
the  'dual  of  the  dual'  graph  again  before  we  perform  feature  construction.  Since  the 
program  blocks  are  executed  in  a  sequential  manner,  the  overall  complexity  result  for 
DUALTREE  is  given  by  the  dominating  bound  among  the  list  of  bounds. 

In  the  previous  chapter  we  showed  that  E"  <  E'  =  E.  Hence,  the  dominating 
bound  is  0(V  +  E),  or,  0(CV  +  j).  If  j  «  CE,  (the  number  of  classes  times  the  number 
of  edges  for  the  input  tree),  the  overall  complexity  for  DUALTREE  is  given  by  0{V  + 
E)\  otherwise  it  is  0(CV  +  j). 


CHAPTER  6 
EXPERIMENTS 


The  initial  objective  of  our  experiments  was  to  explore  the  capabilities  and 
limitations  of  DUALTREE  using  a  decision  tree  given  by  a  standard  decision  tree 
algorithm.  We  evaluate  the  performance  of  DUALTREE  using  two  trees  given  by  C4.5: 
one  before  and  one  after  DUALTREE 's  features  are  added  to  a  set  of  attributes  used  to 
build  the  trees.  The  performance  criteria  we  report  are  (1)  the  size  or  number  of  nodes, 
(2)  the  misclassification  rate,  and  (3)  the  rank  of  the  'Initial'  decision  tree  given  by  C4.5. 
We  test  to  see  if  using  the  features  formed  by  DUALTREE  improves  performance. 
Another  objective  of  our  experiments  was  to  see  if  DUALTREE  forms  features  from  the 
BEBR  business  surveys  that  plausibly  describe  consumer  life-styles  and  expectations. 

We  first  tested  the  algorithm  by  examining  small  DNF  descriptions  of  Boolean 
functions  in  the  presence  of  irrelevant  attributes,  using  three  synthetic  Boolean  domains. 
Next,  we  studied  DUALTREE's  performance  using  nominal  and  continuous  data-sets. 
Finally,  we  used  the  algorithm  to  explore  our  hypotheses  associated  with  the  BEBR 
household  surveys.  We  use  our  empirical  results  to  see  if  the  DUALTREE  procedure  for 
forming  useful  features  produces  feature-sets  of  lower  cardinality.  Next,  we  examine 
descriptions  of  consumer  life-styles  and  expectations,  derived  from  features  formed  using 
the  BEBR  data.  The  following  sections  examine  our  experimental  design  and  discuss 
DUALTREE  feature  construction  using  binary,  nominal,  and  continuous  data. 

159 


160 

6.1    Experimental  Design 

The  experiments  described  in  this  work  were  designed  primarily  to  test  a  single 
component  of  DUALTREE  feature  construction— the  usefulness  of  the  features  in  terms 
of  performance  and  other  criteria.  The  independent  variable  for  the  experiments  is 
essentially  the  set  of  attribute-names  for  a  classification  problem  solved  by  C4.5.  Because 
the  specific  forms  of  the  independent  variables,  or  attributes,  differ  in  each  classification 
problem,  they  are  fully  detailed  in  the  sections  that  follow. 

The  dependent  variables  of  size,  misclassification  rate,  and  rank  were  recorded  in 
each  experiment  for  (1)  the  'Initial'  tree  given  by  C4.5,  and,  (2)  DUALTREE  when 
forming  a  finite  number  of  features.  Also,  C4.5  creates  a  'Simplified'  tree  and  shows 
how  the  simplified  tree  performs  on  the  training  set  from  which  is  was  constructed.  An 
additional  dependent  parameter  depicting  an  error  rate  for  classifying  unseen  cases,  or 
predictive  accuracy,  along  with  the  other  performance  parameters,  was  also  recorded  for 
the  'Simplified'  tree.  This  variable  is  referred  to  as  the  'estimate'  (Est)  for  the  tree.  C4.5 
contains  heuristic  methods  for  simplifying  decision  trees,  with  the  aim  of  producing  more 
comprehensible  trees  without  compromising  accuracy  on  unseen  cases  (Quinlan  1993). 
Along  with  these  primary  dependent  variables,  the  numbers  of  StagelFeatures  and 
Stage4Features,  and,  the  actual  features  formed  (along  with  their  size),  were  also 
recorded.  Although  these  secondary  variables  provide  no  direct  measure  of  performance, 
they  are  helpful  in  explaining  feature-construction-time  results. 


161 
Finally,  recall  we  proposed  that  DU  ALTREE  produces  a  minimal  number  of  useful 

features  for  adding  to  the  set  of  primitive  attributes  from  which  the  features  were  formed. 

We  examine  our  hypotheses  in  the  following  way:  first,  we  compare  the  number  of 

features  formed  by  DUALTREE  to  the  total  number  of  features  we  can  form  using  the 

set  of  primitives  found  in  the  input  decision  tree  (this  set  may  not  be  identical  to  the  set 

of  primitives  used  for  building  the  tree).    If  this  comparison  reveals  that  the  percentage 

of  total-possible-features  formed  by  DUALTREE  is  consistently  relatively  low,  then 

essentially  we  can  conclude  that  DUALTREE  gives  a  minimal  number  of  features  for 

forming  a  feature-set  of  'lower'  cardinality,  when  compared  to  the  cardinality  of  the 

feature-set  containing  the  total  number  of  possible  features  we  can  construct  using  the  set 

of  primitives  found  in  the  input  tree.    Next,  we  compare  the  number  of  new  features 

appearing  in  a  decision  tree  produced  by  C4.5  to  the  total  number  of  new  features  we 

added  to  the  set  of  primitive  attributes  for  building  the  tree.   If  this  comparison  shows  a 

percentage-level  of  'new-feature-use'  that  is  consistently  relatively  high,    we  infer  that 

DUALTREE's  features  generally  have  higher  information  gain,  (i.e.  when  the  gain-ratio 

measure  is  used  as  the  node  selection  criteria),  than  the  information  gain  associated  with 

the  primitive  attributes.    Thus,  for  our  purposes,  we  may  infer  that  DUALTREE  indeed 

forms  a  'useful'  set  of  features  from  a  given  set  of  primitive  attributes. 

6. 1  ■  1    Experimental  Technique 

The   objective   of  each   series   of  experiments  was   to  determine   the  relative 
performance    of   the    different   values    for    a    particular    set    of   features    formed    by 


162 
DUALTREE,  with  respect  to  a  given  classification  problem  and  an  initial  decision  tree. 

To  accomplish  this,  DUALTREE  was  run  using  decision  trees  created  from  training  sets 

of  various  sizes,  primarily  from  two  different  sources.    The  first  group  of  training  sets 

represent  three  of  the  four  synthetic  Boolean  domains  used  by  Pagallo  (1990):  (1)  small 

random  DNF;  (2)multiplexor;  and  (3)  parity.  These  domains  provide  useful  benchmarks 

for  testing  feature-construction  algorithms.  Li  fact,  Matheus  (1989)  also  used  a  subset  of 

functions  from  these  domains  to  test  his  CITRE  feature  construction  algorithm.     A 

detailed  description  of  test  functions  from  these  domains  is  found  in  Pagallo  (1990). 

The  sample-sizes  varied  according  to  the  particular  classification  problem.  Pagallo 
(1990)  provides  a  formula  giving  the  approximate  number  of  examples  required  by  an 
ideal  learning  algorithm  that  always  produces  a  consistent  hypothesis  and  only  considers 
hypotheses  that  can  be  expressed  with  at  most  the  number  of  bits  needed  for  the  target 
concept.  Other  sample  complexity  results  are  found  in  Blumer  et  al.  (1987),  Ehrenfeucht 
and  Haussler  (1989),  and  Natarajan  (1991). 

A  second  source  for  the  data-sets  we  tested  using  DUALTREE  is  the  UCI 
Repository  of  Machine  Learning  Databases  (Murphy  and  Aha  1992),  located  at  the 
University  of  Cahfornia  at  Irvine.  This  repository  contains  data-sets  and  domain  theories 
developed  by  other  researchers  that  were  used  to  evaluate  different  learning  algorithms. 
The  data-sets  we  used  include  (1)  The  Monks  problem,  (2)  Mushroom  classification,  (3) 
Tic-tac-toe  learning,  (4)  Consumer  Credit  Applications,  and,  (5)  Glass  classification.  A 
good  mixture  of  nominal  and  continuous  attributes  exist  for  many  of  these  data-sets. 


163 
Each  test  consisted  of  running  DUALTREE  using  a  series  of  edges  given  by  C4.5. 

Recall  that  C4.5  requires  a  set  of  examples,  a  set  of  primitives,  and  a  list  of  parameter 

settings  specifying  values  for  the  primitive  attributes.  Also,  the  edges  of  a  binary  tree  are 

necessary  for  DUALTREE  feature  construction.   Hence,  if  necessary,  we  first  'binarize' 

the  data  as  described  in  Chapter  5  before  running  C4.5.    The  dependent  variables  were 

recorded  and  analyzed. 

6.1.2   Presentation  of  Results 

The  experimental  results  are  presented  in  the  following  sections  using  tables,  bar 
graphs  and  other  output  listings  from  DUALTREE.  The  linear  plots  are  two-dimensional 
graphs  having  the  test-concept,  or  function-name  (fn),  plotted  along  the  x  axis.  In  all 
graphs,  a  single  plotted  point  or  bar  represents  a  performance  result,  according  to  certain 
criteria,  given  by  C4.5  and/or  DUALTREE.  The  independent  variable  isfns  attribute-set. 

6.2    Feature  Constniction  Using  Binary  Data 

The  target  concepts  for  the  first  series  of  experiments  were  drawn  from  the  class 
of  /-term  ^DNF  Boolean  functions  for  binary  attributes  (Pagallo  and  Haussler  1988; 
Matheus  1989;  Pagallo  1990).  Using  this  domain,  we  examine  different  levels  of 
difficulty  by  simply  changing  the  size  of  the  formula.  The  DNF  functions  are  dnf],  dnf2, 
dnfS,  and  dnf4. 

The  multiplexor  functions.  mx6  and  mxl  1 ,  were  used  for  testing  the  multiplexor 
domain.   There  exists  a  multiplexor  function  defined  on  a  set  oi  k  +  2*  attributes  or  bits 


164 

for  each  positive  integer  k.    We  define  the  function  by  taking  the  first  k  attributes  as 

address  bits  and  the  last  attributes  as  data  bits,  thus,  the  function  has  the  value  of  the  data 
bit  indexed  by  the  address  bits  (Wilson  1987;  Pagallo  1990).  For  our  test  functions,  k 
equals  2  and  3  respectively. 

The  final  two  test  functions  for  examining  binary  data  represent  the  parity  domain 
where  for  each  positive  integer  k,  there  exists  an  even/odd  parity  function  defined  $x  a 
set  of  k  attributes  or  bits.  The  function  has  the  value  of  true  if  an  even/odd  number  of 
attributes  are  present  (i.e.  is  a  1),  otherwise  it  has  the  value  of  false.  The  first  k  attributes 
are  used  as  the  parity  bits,  and  in  our  experiments  equals  4  and  5  for  par4  and  par5 
respectively. 

We  use  a  random  number  generator  that  gives  equally  probable  Os  and  Is  for 

setting  the  value  of  each  attribute  or  bit.  The  method  for  generating  random  bits  is  based 

on    'primitive  polynomials  modulo   2'   which  are   special  polynomials  among  those 

polynomials  whose  coefficients  are  zero  or  one  (Press  et  al.  1992).    From  Press  et  al. 

(1992)  we  find: 

Every  primitive  polynomial  modulo  2  of  order  n  defines  a  recurrence  relation  for 
obtaining  a  new  random  bit  from  the  n  preceding  ones.  The  recurrence  relation 
is  guaranteed  to  produce  a  sequence  of  maximal  length,  i.e.,  cycle  through  all 
possible  sequences  of  n  bits  (except  all  zeros)  before  it  repeats.  Therefore  one  can 
seed  the  sequence  with  any  initial  bit  pattern  (except  all  zeros),  and  get  2"  -  1 
random  bits  before  the  sequence  repeats.    (297) 

The  interested  reader  is  referred  to  Press  et  al.  (1992)  for  a  description  of  methods  that 

obtain  random  bits  using  a  shift  register  and  a  primitive  polynomial  modulo  2. 

Table  6. 1  table  gives  a  concise  description  of  the  test  functions  by  showing  the 

total  number  of  attributes,  the  number  of  terms,  the  length  of  the  shortest  and  longest 


Table  6.1    Boolean  target  functions 


165 


target 
concept 

description 

monotone 

Qttriktes 

j       term  length 

terms  \  shortest 

longest 

average 

M 

random  DNF 

yes 

80 

9    i      5 

5.8 

Ml 

random  DNF 

yes 

40 

8    1      4 

4.5 

dnf3 

random  DNF 

no 

32 

6    i      4 

5.5 

M 

random  DNF 

no 

64 

0    i      3 

4.1 

ini(i 

6-multiplexer 

no 

16 

4    i      3 

3.0 

mill 

ll-multiplexer 

no 

32 

8    1      4 

4.0 

pai4 

4-paiity 

no 

16 

8    \     4 

4.0 

par5 

5-paiity 

no 

32 

16    1      5 

5.0 

term,  and  the  average  term  length  (Pagallo  1992).  The  number  of  irrelevant  attributes 
present  for  the  functions  are  10  and  21  for  mx6  and  mx]  1  respectively,  and,  12  and  27 
for  par4  and  par5.  The  number  of  irrelevant  attributes  varied  for  the  DNF  functions 
which  are  listed  in  the  section  that  follows.  Our  convention  for  denoting  the  C4.5  output 
using  features  given  by  DUALTREE  for  building  the  tree  is  to  place  an  'f  in  front  of 
the  function-names.  Thus  the  attribute-set  for  dnfJ  simply  consist  of  the  initial  prime 
attributes,  while  the  attribute-set  for  fdnfl  is  a  set  containing  the  primitive  attributes  plus 
the  features  formed  using  DUALTREE. 

For  our  tests,  the  value  of  the  function,  0  or  1,  represents  the  class-assignment  for 
each  example  of  the  function.     Table  6.2  displays  a  rather  evenly-distributed  class 


166 


Table  6.2   Class  distributions  for  binary  data-sets 


1    f^^ 


■IBIBIBIBIBIBII 

Frequency 

Percent  \ 

rtction 

size 

O            1 

O        1     1 

dnfl 

■laiaiBiBiaiBii 
3292 

2752     1        540         83.6     |  16.4    I 

diif2 

2185 

1652     1        533 

75.6     j  24.4    1 

dnf3 

1650 

1409     [        241 

85.4     1  14.6    1 

rlTif4 

2640 

1333     1      1307 

50.5    I   49.5   1 

rrLx6 

720 

370     !        350 

51.4  !    48.6  i 

iiixl  1 

1600 

810     1        790 

50.6  1    49.4  1 

par4 

1280 

635       j      645 

49.6    [  50.4    1 

par5 

4000 

2004      1    1996 

50.1    {  49.9    1 

■IBIBIBIBIBIBIBIBII 

■■■■laiaWMWWtalHIBIBIBIBIBIBiaiBIBIBIHmBWWIHIBIBWI^ 

distribution  for  five  of  the  eight  test  concept. 


6.2.1    DNF  Functions  Test  Results 


Following  is  a  description  of  the  random  DNFs  created  by  Fagallo  ( 1 990)  that  we 
also  used  as  test- target-concepts: 
dnfl  = 


dnfZ  = 


dnf3  = 


^5^2&^iS^12^14^7f> 

+ 

^2''l6'^40^52'^74 

-t- 

X, 0X21X23X2^X30X^3 

-t- 

'^4o'^56'^58^60^63'^72 

+ 

^6'^24^36^37''39^48 

+ 

X3X 17X45X55X72X75 

+ 

^  1 1  ^48'^50'^64  ^69^74 

+ 

X2XJ5X27X35X50X53 

-l- 

Xf,Xi2X22X45Xf,o 

X|X3X|4X|gX2fiX35X3(^ 

+ 

XgX|5X3|X37 

+ 

'^5''lO^I4'^27''29 

+ 

^18^20^30^36 

-1- 

2    3    9    19    24 

+ 

^24^25^27^^36^37 

+ 

^6^7^  14''25'^26''3 1  ^34 

+ 

X,X5X22X3o 

^l'^2'^6''8^25^28'^29 

/        /        / 
A  1  A4  '*I9  Xtt  \27X2fi 

+ 
+ 

X2XgX,4Xig  X22  X25 
^^2  ^10^14^21  ^24 

+ 
-1- 

167 


dnf4  = 


^\\^ll^\9^2{>^25 

+ 

X,  X4  X,3X25 

X,X4X|3X57X59 

+ 

X,j(X22  X24 

+ 

/                / 

+ 

/                / 

X9  XjjXjg  X55 

+ 

/                / 

X5  X29X4g 

+ 

X23X33X40X52 

+ 

f        f        f 

+ 

'^6'^ll'^36''55 

+ 

^f>  ^  '^lO  '^ 39^46 

+ 

X3X4X21X37  X55 

Table  6.3  presents  performance  results  obtained  using  C4.5  with  the  features 
formed  by  DUALTREE  for  each  DNF  test  concept.  The  table  also  shows  results  for 
using  a  'binarized'  form  of  the  'names'  and  'data'  file  for  C4.5.  Our  convention  for 
denoting  the  C4.5  results  using  a  'binarized'  form  for  the  input  is  to  place  a  'b'  in  front 
of  the  function-names.  The  'starred'  entries  ('*')  represent  an  improvement  in 
performance.  Recall  that  C4.5  uses  the  gain  ratio  criteria  for  selecting  a  decision 
variable,  it  splits  the  sample  until  there  are  two  or  less  observations  in  the  split,  and 
simplifies  the  tree  using  a  tree-simplifying-heuristic. 

Essentially,  Table  6.3  shows  that  using  features  formed  by  DUALTREE  results  in 
performance  comparable  to  a  decision  tree  given  by  C4.5.  The  size  of  the  initial  tree, 
after  adding  the  new  features,  was  smaller  only  for  dnf3  and  fdnf3--a  reduction  of  four 
nodes.  In  the  worst  cast,  fdnf2,  we  increased  the  size  of  the  tree  for  dnf2  by  144  nodes. 
This  case  also  exhibits  the  worst  performance  for  the  misclassification  rate— an  increase 
of  0.7  percentage  points,  and  the  predictive  accuracy--an  increase  of  4.0  percentage 
points.  However,  for  this  case,  observe  that  we  have  produced  a  simplified  tree  (i.e.,  the 
tree  after  pruning),  with  a  smaller  rank  than  the  simplified  tree  constructed  without  using 
DUALTREE' s  features.  This  was  one  of  only  a  few  instances  where  we  decreased  the 
rank  of  a  tree.    Considering  the  initial  trees  (i.e.,  the  trees  before  pruning),  none  of  the 


168 


Table  6.3   C4.5  results  using  DUALTREE's  features. 


fn 

atts 

# 

obs 

Before  Pruning 
size     Errors     m 

.............. -.-.-, 

After  Pruning        \ 
size     Errors      Est    m\ 

dnfl 
fdnfl 

80 
156 
160 

267 

3292 

467 
531 

55(1.7%)    1  5 
75(2.3%)    I   5 

287  1    122(3.7%)  •   9.7%  .  4  | 
305 ;     149(4.5%)  I  10.8%   \  4  j 

bdnfl 
fMnfl 

465 

521 

56(1.7%)  T  5 
63(1.9%)  1  5 

287  ;    122(3.7%)  \   9.7% 
257^^     142(4.3%)!    9.7% 

4  i 
3*1 

dTif2 
fdnf2 

40 
83 
80 
139 

2185 

301 
445 

39(1.8%)  ;  4 
54(2.5%}_  >  4 

165  i      89(4.1%);    9.5% 
275  1      112(5.1%)!  13.5% 

4   j 
3*1 

b(inf2 
fbdnf2 

301 

335 

39(1.8%)  T4 
39(1.8%)  ;  4 

165;       89(4.1%)!     9.5% 
231  1     64(2J)%)*;    9.9% 

4  1 
4  1 

(inO 
fdnfS 

32 
73 
64 
97 

1650 

137 
133* 

19(1.2%) ;  3 

20(1.2%)  1  4 

41  ;      48(2.9%) ;    5.1% 
41   1      52(3.2%)  '   5.3% 

2  1 

3  1 

bdnf3 
fbdnO 

137 
153 

'     19(1.2%)  T  3 
20(1.2%)   ;  3 

41   r   48(2.9%)"!   5.1%^ 
31*;      62(3.8%);    5.5% 

2  [ 
2    1 

dnf4 
fdnf4 

64 

145 
128 
182 

2640 

593 
673 

77(2.9%)  1  6 
^     79(3.0%);  6 

527  ;      99(3.8%)  \  16.2% 
553  J_  113(4.3%)J  17.2%^ 

6  1 
6  1 

bdnf4 
fbdnf4 

593 
697 

76(2.9%)    1  6 
82(3.1%)    1  6 

533 1      96(3.6%)  i  16.1% 
563;     125(4.7%);   17.8% 

6  1 

5*i 

^ \ 

'f  <func-naine>  =">  C4.5  using  features  formed  with  Initial  Tree         'b'<func-naine>  "->  C4.5  uses  'Binaiized'  files 
BOLDFACE*  -->  Improvement'  in  perfonnance 


runs  resulted  in  building  a  tree  having  a  smaller  rank.  In  fact,  for  dnf3  and  fdnf3,  even 
though  we  built  a  smaller  tree  having  essentially  the  same  misclassification  rate,  the  rank 
for  the  tree  is  higher  than  the  rank  of  the  tree  built  from  the  primitives. 

Next,  observe  that  the  performance  results  when  using  a  'binarized'  form  of  the 
initial  input  files  for  C4.5  are  not  always  identical  to  the  results  associated  with  using  the 
initial  input  files  (i.e.,  the  error-rates  and  estimates  for  dnf4  and  bdnf4  are  not  equal). 
This  was  a  rather  unexpected  result.  However,  the  action  of  'binarizing'  the  data  did  not 
function  to  degrade  performance  in  any  of  the  runs.    In  fact,  we  can  say  that  is  some 


169 

cases,  'binarizing'  the  data  may  actually  improve  performance.    For  example,  consider 

dnf2  and  bdnfZ.  The  sizes,  error-rates,  and  ranks  are  identical,  as  we  expect  them  to  be. 
Using  drift,  DUALTREE  added  43  features  to  the  set  of  40  primitives,  which  doubled  the 
size  of  the  feature-set  (recall  that  the  running  time  for  C4.5  fundamentally  depends  on  the 
total  number  of  attributes  we  test  for  choosing  the  one  giving  the  best  "spUt-information'); 
whereas,  when  using  bdnf2,  DUALTREE  formed  59  features  which  is  less  than  the 
number  of  initial  'binarized'  primitives.  Thus,  we  do  not  double  the  size  of  the  feature- 
set  for  this  case.  Observe  also  that  the  tree  built  using  these  59  features  outperforms  the 
tree  created  using  the  initial  primitives  along  with  the  43  features  in  the  tree's  feature- set. 

Finally,  we  examine  the  data  to  see  if  DUALTREE  forms  a  'minimal'  number  of 
'useful'  features.  The  next  section  examines  the  usefulness  of  DUALTREE's  features  in 
terms  of  their  likelihood  of  being  selected  using  the  gain-ratio  measure.  We  then  describe 
our  experimental  results  for  test-functions  representing  the  multiplexor  and  parity 
domains. 
6.2.1.1    Useful  features  of  DNT  functions 

Table  6.4  shows  feature-formation  results  for  the  features  formed  by  DUALTREE 
for  our  test  functions  representing  the  DNF  domain.  The  table  also  shows  'feature-usage' 
results  for  a  decision  tree  constructed  using  an  attribute-set  containing  DUALTREE's 
features.  Note  that  we  can  form  at  most  2"  features  where  n  is  the  number  of  primitive 
attributes  used  in  the  input  decision  tree.  Clearly,  for  all  runs,  the  proportion  of  features 
formed  to  the  total  number  of  features  possible  is  less  than  one  half  of  one  percent. 
Hence,  for  DNF  functions,  DUALTREE  indeed  forms  a  "minimal'  number  of  finite 


170 


Table  6.4   Feature-formation  results  for  DNF  functions 


fn 

IIBIBIBiaiBIBII 

primes 
used 

IBIHimiHIHiaiBI 

# 

features 
form.ed 

# 

features 

used 

max 
size 
formed 

■■■■■■■■■■■■■■II 

max 
size 
used 

iiHiaiHiBiaiHia 

#  edges 

wUh 
feature 

fdnfl 

75 

76 

44 

41 

11 

250 

fbdnfl 

101 

107 

64 

22 

10 

250 

fdnfZ 

40 

43 

26 

27 

8 

123 

fbdnfZ 

55 

59 

33 

17 

9 

146 

fdnf3 

29 

41 

11 

17 

7 

43 

fbdnfS 

34 

33 

12 

14 

8 

48 

fdnf4 

64 

81 

45 

26 

9 

358 

fbdnf4 

101 

54 

38 

13 

8 

278 

features,  and  feature-sets  containing  DUALTREE's  features  are  of  lower  cardinality  than 
feature-sets  containing  all  possible  features.  This  is  key  since  the  execution  time  of  C4.5 
depends,  among  other  parameters,  on  the  number  of  features  in  the  attribute-set  and  on 
the  overall  number  of  tests.  Attribute-sets  having  a  large  number  of  features  slows  down 
the  computation  time  of  C4.5  since  every  example  is  evaluated  at  each  feature  thus 
leading  to  a  large  number  of  tests,  and,  the  algorithm  has  to  rank,  at  each  node,  all  the 
tests  in  order  to  select  the  best  one. 

Figure  6.1  illustrates  'feature-usage'  results  by  comparing  the  number  of  features 
formed  by  DUALTREE  to  the  number  of  features  actually  used  in  the  decision  tree 
constructed  from  an  attribute-set  containing  the  features,  for  each  test  function,  and  its 
'binarized'  form.  This  feature-usage  percentage,  given  by  the  ratio  of  the  number  of 
features  used  to  features  formed,  ranges  from  a  high  of  70.4%  for  fbdnf4  to  a  low  of 


17] 


Feature  Usage 


fdnfl    fbdnfl    fdnf2   fbdnf2   fdnf3   fbdnf3   fdnf4   fbdnf4 


fii 


Legend 
Features  Fonned 
Features  Used 


Figure  6.1    Comparison  of  features  used  to  features  formed 


26.8%  for  fdnf3.  Thus  we  can  say  that,  for  the  DNF  functions,  a  significant  number  of 
the  minimal  amount  of  features  formed  by  DUALTREE  are  chosen  by  C4.5's  decision- 
tree-heuristics.  There  are  several  reasons  why  C4.5  may  not  choose  a  feature  for  a  node. 
First,  a  feature  may  not  be  chosen  because  it  is  irrelevant  to  the  target  concept.  Since  we 
used  irrelevant  attributes,  we  may  in  fact  form  irrelevant  features.  Second,  a  feanire  may 
not  be  chosen  because  it  has  become  part  of  a  more  useful  feature.  Finally,  a  feature  may 
not  be  used  because,  given  a  set  of  features,  there  are  equivalent  ways  to  represent  a 


172 
concept  by  a  decision  tree.     As  an  example,  consider  an  attribute-set  containing  the 

primitive  attributes  x,,  Xj  and  X3,  along  with  the  features  XiXj  and  X2X3.   Assume  that  we 

want  to  represent  the  concept  XjX^x^  by  a  decision  tree  using  this  attribute- set.    A  tree 

using  the  features  x,X2  and  X3,  and  a  tree  with  features  x,  and  X2X3,  represent  equivalent 

descriptions  for  the  concept.    In  the  first  case,  the  features  XjX,,  x,  and  X;  are  not  used; 

while  in  the  second  case,  we  do  not  use  Xj,  x,  and  X|X2. 

Figure  6.2  depicts  'feature-usage'  results  based  on  the  number  of  edges  in  a  tree 
and  the  number  of  those  edges  have  a  new  feature  as  a  predecessor  or  successor  node. 
The  percentage  given  by  the  number  of  edges  having  a  feature  to  the  number  of  edges 
in  the  tree,  varies  from  32%  for  fhdnf3,  to  53%  for  fdnf4.  Thus,  we  can  say  that  a 
significant  number  of  edges  in  a  tree  formed  by  C4.5  for  these  test  functions,  will  contain 
features  formed  by  DUALTREE.  This  information  strengthens  our  claim  that  the  features 
formed  by  DUALTREE  are  likely  to  be  used  by  C4.5  when  building  decision  trees. 

Finally,  Table  6.4  also  gives  results  based  on  feature  sizes.  The  size  of  a  feature 
is  simply  the  number  of  primitive  attributes  comprising  it.  Table  6.4  shows  the  size  for 
the  largest  feature  formed  by  DUALTREE.  and  the  size  of  the  largest  feature  used  by 
C4.5  for  building  a  tree,  for  each  test  function.  In  most  of  the  runs  DUALTREE  formed 
features  representing  terms  for  the  concept  that  may  'over  generalize'  the  concept—since 
the  maximum  size  of  the  features  used  tends  to  be  much  smaller  than  the  maximum  size 
of  the  features  formed.  Because  a  meaningful  amount  of  DUALTT^EE's  features  are  used 
by  C4.5,  we  do  not  further  examine  this  result  in  this  research.  This  information  however 
may  suggests  a  need  for  using  a  feature  pruning  method,  or  some  form  of  bias  that  favors 


173 


Feature  Usage 


fl 


1 


fi 


fdnf1    fbdnfl    fdnf2   fbdnf2   fdnf3   fbdnfS   fdnf4   fbdnf4 


& 


Figure  6.2   Comparison  of  edges  with  features  to  total  edges 


smaller  features  to  larger  ones. 


6.2.2    Multiplexor  and  Parity  Functions  Test  Results 


Legend 
Edges  in  Tree 
Edges  with  Featun 


This  section  describes  our  experimental  results  for  using  DUALTREE  in 
conjunction  with  mx6  and  wu7/— representing  the  multiplexor  domain,  and,  par4  and 
parJ-representing  the  parity  domain.  Overall,  the  performance  results  are  more  favorable 


174 
than  the  results  for  the  DNF  functions  we  tested.    The  data  suggests  that  DUALTREE 

forms  features  that  aid  in  simplifying  decision  trees. 

Table  6.5  displays  performance  results  acquired  using  C4.5  with  DUALTREE's 
features  for  the  multiplexor  and  parity  test  concepts,  along  with  the  corresponding  results 
for  using  a  binarized  form  of  the  input  data.  We  see  that,  in  general,  using  features 
formed  by  DUALTREE  results  in  improved  performance  for  a  decision  tree  given  by 
C4.5.  After  adding  DUALTREE's  features,  we  decreased  the  size  of  the  trees  in  most 
of  the  runs.  In  the  worst  case,  fbpar5 ,  we  increased  the  size  of  the  initial  tree  for  bparS 
by  210  nodes.  However,  note  that  this  tree  has  both  a  lower  misclassification  rate  and 
smaller  rank.  The  rate  falls  by  .4  percentage  points  and  the  rank  is  reduced  by  3.  This 
run,  fbparS,  also  exhibits  the  worst-case  performance  for  the  predictive  accuracy  of  the 
trees~an  increase  of  1.5  percentage  points,  but  note  that  the  tree  has  a  rank  of  5  as 
opposed  to  8  for  bparS. 

The  tree  for  fbmx6  also  has  16  more  nodes  than  the  tree  representing  bmx6.  In 
this  case,  the  rank  remained  unchanged  and  the  misclassification  rate  increased  by  0.6 
percentage  points.  However,  also  observe  for  this  case  that,  for  the  simplified  tree, 
DUALTREE's  features  resulted  in  a  smaller  tree  both  in  terms  of  its  size  and  rank.  Note 
that  this  case  also  exhibits  the  worst-case  performance  for  the  misclassification  rate— an 
increase  of  0.6  percentage  points. 

The  best  performance,  in  terms  of  size-reductions  for  the  trees,  is  illustrated  by 
fhmxl  J .  which  is  46  nodes  smaller  than  bnul L  or  a  25.4%  reduction.  Note  also  that  we 
improved   the   performance    of   the   estimate    by   4.5   percentage   points.      The    best 


175 


Table  6.5    C4.5  results  using  multiplexor  and  parity  functions 


fn 

# 

atts 

# 

obs 

Before  Pruning 
size     Errors     m 

After  Pruning        \ 
size     Errors      Est     mi 

43T     0(0.0%)    1  4.1%    1  3  1 

37*;    0(0.0%) ;  3.6% ;  3 1 

mx6 
finx6 

16 
19 

32 
41 

720 

43  y     0(0.0%)   ^3 
37*;      0(0.0%)    ;   3 

bmx6 
fbmx6 

43   ;      0(0.0%)  T  3 
59   1      4(0.6%)    '  3 

43 ;    0(0.0%)  ;  4.1%  ;  3  | 

23*1      16(2.2%)  1  4.7%   >  2*| 

1                          1                 1        ! 

mxll 

finxll 

bmxll 

fbmxll 

32 
49 
62 
91 

1600 

185'       2(0.1%)  ;  5 

1674      4^0.2%)    i4* 

185;      2(0.1%)  T  5" 

139*1       3(0.2%)  ;  3* 

1                                 1                     ,          ! 

185  1       2(0.1%)  ;  7.6%   1    5| 
101*1       5(^3%l  1    4.6%-^  4i 

185 ;     2(o.i%)~!  7.6%  ;  5  1 

71*1      0(0.0%)*;    3.1%*;  3*1 

par4 
fipai4 

16 

34 

32 
72 

1280 

591 ;   168(13.1%);   7 
573*-  114(8i>%)*;  5- 

395;  227(17.7%);   37.4%;   6i 
113*1  203(15,9%)'^  22.9%  1^4*1 

bpar4 

59lT   168(13.1%)T7 
555*  100(7^%)*;  5* 

395  r  227(17.7%)!    37.4%{    6i 
31*    73(5.7%)*;  7.7%*;  3*| 

par5 

fpnrS 

bparS 

fb^5 

32 
78 
64 
121 

4000 

1705;    291(7.3%)  ;  8 
1805;  279(7.0  %)*J^  6* 
1711    289(7.2%)  1  8' 
192i;  272(6,8%)*;   S 

145^     378(9.4%);    31.3%;   l\ 
1415^  393(9.8%);    31.3%_;  6-^1 
1473"    371(9.3%)    31.4%i    8} 
1495(    416(10.4%)     32.99^   5*1 

'f  <func-name>  ==>  C4.5  using  features  fonned  with  Initial  Tree         'b'<func-naine>  -■>  C4.5  uses  'Binarized'  files 
BOLDFACE*  ■■>  Inqnovemenf  in  performance 


performance  with  regard  to  the  misclassification  rate  was  given  by  fl)par4--ar\ 
improvement  of  5.3  percentage  points  over  the  rate  for  bpar4.  This  case,  fbpar4,  also 
exhibits  the  best-case  performance  for  the  estimate— a  drop  of  29.7  percentage  points  from 
the  estimate  associated  with  bpar4.  FbparS  performed  best,  in  terms  of  the  rank  of  the 
initial  trees,  since  the  rank  of  its  tree  is  3  less  than  the  rank  of  the  tree  associated  with 
bparS.  Observe  that  the  rank  did  not  go  up  for  any  of  the  initial  trees.  In  fact, 
DUALTREE's  features  led  to  trees  having  a  smaller  rank  in  most  of  the  cases.  Next,  we 
discuss  the  'feature-usage'  results  for  these  domains. 


176 

6.2.2.1    Useful  features  of  multiplexor  and  parity  functions 

Given  the  previous  discussion,  it  appears  that  DUALTREE  works  well  with 
functions  representing  the  parity  domain,  and  also  gives  favorable  results  when  using 
functions  from  the  multiplexor  domain.  In  this  section  we  show  that  the  'feature-usage' 
results  for  these  domains  are  comparable  to  the  results  we  obtained  using  functions 
representing  the  DNF  domain. 

Table  6.6  illustrates  feature-formation  results  for  the  features  formed  by 
DUALTREE  using  functions  representing  the  multiplexor  and  parity  domains.  Observe 
that,  for  all  runs,  the  proportion  of  features  formed  to  the  total  number  of  possible 
features  is  again  less  than  one  half  of  one  percent,  except  for  fmx6.  In  this  case, 
DUALTREE  formed  4.7%  of  the  total  number  of  features  we  can  construct  using  the  6 
primitives.    This,  however,  is  still  a  minimal  amount  of  features. 

Observe  also  that  the  'over  generalization'  idea  does  not  lend  itself  well  to  this  set 
of  runs  because  the  maximum  sizes  of  features  used  in  the  trees  are  not  very  different 
from  the  maximum  sizes  of  the  features  formed.  In  fact,  in  several  of  the  cases,  these 
sizes  are  identical. 

Figure  6.3  illustrates  'feature- usage'  results  by  depicting  the  number  of  features 
formed  and  used  for  each  test  concept.  Consider  the  ratio  of  the  number  of 
DUALTREE's  features  used  in  a  decision  tree  to  the  total  number  of  features  that 
DUALTREE  formed  for  each  of  the  runs.  We  see  that  this  ratio  ranges  from  100%~for 
fnv(6  andfpar4,  to  44%  for  fbmx6.  Hence,  we  can  say  that  for  functions  representing  the 
multiplexor  and  parity  domains,  a  significant  number  of  the  minimal  amount  of  features 


177 


Table  6.6   Multiplexor  and  parity  feature-formation  results 


^■■■■■■■■■■■■■■■■■■ll 

fn 

# 

primes 

used 

■■■■■■■■■■■■■■I 

# 
features 
form.ed 

# 

features 

used 

max 
size 
formed 

laiaiBIBIHIHIHII 

max 
size 
used 

iiBtaiaiaiaiBiBL 

#  edges  i 

with     i 

feature  \ 

frax.6 

6 

3 

3 

2 

2 

10      i 

fbnix6 

11 

9 

4 

6 

3 

13        1 

ftnxll 

14 

17 

9 

6 

6 

26       \ 

fbmxll 

21 

27 

13 

8 

8 

32       1 

fipar4 

16 

18 

18 

6 

6 

i"47      1 

fbpar4 

30 

40 

33 

9 

8 

211      1 

fparS 

32 

46 

36 

15 

11 

504      1 

fbparS 

62 

■■■■■■■■■■■■■I 

57 

52 

■  ■■■■■■■■■■■IB 

17 

12 

■  ■■IHIBIHIHIHIHl 

966     1 

formed  by  DUALTREE  are  chosen  by  C4.5's  decision-tree-heuristics. 

Finally,  Figure  6.4  shows  'feature-usage'  results  based  on  the  number  of  edges  in 
a  tree.  We  see  that  the  percentage  of  edges  used  in  tree  that  contains  DUALTREE 
features,  ranges  from  50.3%  for  fbparS,  to,  15.7%  for  fbmxll.  This  information  further 
strengthens  our  claim  that  the  features  formed  by  DUALTREE  are  likely  to  be  used  by 
C4.5  when  building  decision  trees. 

6.2.3    Summary  of  Results  for  Binary  Data 


We  conclude  our  study  of  DUALTREE's  performance  using  binary  data  and  C4.5 
by  summarizing  our  test  results  for  functions  representing  the  DNF,  multiplexor,  and 
parity  domains.  We  omit  other  characteristics  of  DUALTREE  feature  construction,  but 
in  the  final  chapter  we  suggest  other  aspects  of  the  method  as  areas  of  future  study.   For 


178 


features 
65 
60 
55 
50 
45 
40 
35 
30 
25 
20 
15 
10 
5 
0 

fii 


Feature  Usage 


Li 


i 


i 


i 

1 

.1 


fmx6    fbmxe  fmxil  fbmxil  fpar4   fbpar4   fparS   fbparS 


Legend 
Features  Formed 
Features  Used 


Figure  6.3    Feature  formation-rates  of  multiplexor  and  parity  functions 


example,  our  criteria  for  useful  features  are  based  on  a  specific  'goodness  of  split' 
measure  used  to  build  decision  trees.  An  interesting  study  would  be  to  redefine  this 
'usefulness'  criteria  and  test  DUALTREE's  featiu^es  using  a  new  criteria  such  as  for  DNF 
concepts,  an  adequate  feature,  which  is  simply  a  term  of  the  DNF  concept.  In  many  of 
the  experiments,  the  tree  constructed  using  DUALTREE  features  matched  or  outperformed 
the  tree  constructed  using  the  primitive  attributes.  This  is  adequate  for  our  purposes  since 
our  fundamental  goal  is  to  see  if  DUALTREE's  features  represent  plausible  descriptions 


179 


edges 
2000 

1800 

1600 

1400 

1200 

1000 

800 

600 

400 

200] 

0 


Feature  Usage 


rTT^iii^ V//^naa 


I 


Legend 
Edges  in  Tree 
Edges  with  Featun 


fmx6    fbmx6  fmxil  fbmxil  fpar4  fbpar4   fparS   fbparS 


& 


Figure  6.4   Edge-usage  results  using  multiplexor  and  parity  functions 


of  consumer  life-styles  and  expectations.  Other  researchers  who  have  examine  the 
performance  of  feature  construction  algorithms,  using  different  approaches,  report  results 
based  on  the  'sample  size',  or  'number  of  iterations'  used  by  the  algorithm  (Matheus 
1989;  Pagallo  1990).  Our  results  are  primarily  based  on  examining  two  decision  trees- 
one  before  and  one  after  DUALTREE's  features  are  added  to  a  given  attribute-set.  In  our 
approach,  the  number  of  new  features  constructed  is  relative  to  a  given  decision  tree-its 
size  and  the  number  of  attributes  used  in  the  tree.   A  big  problem  in  feature  construction 


180 
is  constraining  the  growth  of  the  space  of  potential  features.  DUALTREE  forms  features 

following  a  'formal  theory '—minimization  of  finite  automata— instead  of  using  an  heuristic 

approach  such  as  rules  for  selecting  nodes  in  a  tree  to  use  for  forming  the  features  based 

on  their  relative  positions  in  the  tree.  We  do  not  explore  making  a  comparative  analysis 

between  DUALTREE  and  other  feature-construction-algorithms  since  DUALTREE's 

method  symbolizes  a  new  approach  for  forming  features.    However,  our  approach  does 

not  produce  a  'forest'  of  decision  trees  when  forming  features,  as  do  other  methods.  With 

respect  to   the  running  time  required   for  forming   the   features,   our  approach   is  a 

computationally  efficient  way  to  form  a  finite  number  of  features. 

Figure  6.5  shows  performance  results  for  the  Boolean  functions  we  tested  using 

C4.5.  Using  two  trees,  one  before  and  one  after  adding  DUALTREE's  features,  the  figure 

shows:  (1)  sizes  of  the  initial  trees,  (2)  ranks  for  the  initial  trees,  (3)  misclassification 

rates  for  the  initial  trees,  and,  (4)  the  predictive  accuracy  or  estimates  for  the  trees.   For 

clarity  in  the  figure  we  describe  the  function-names  compactly.   For  example,  //  and  bfl 

represent  dnfl  and  bdnfl,  respectively,  6  and  b6  represent  mx6  and  bnvc6,  and,  r4  and  br4 

represent  par4  and  bpar4.    We  have  sixteen  data  points  comparing  two  decision  trees. 

Note  that  the  size  and  rank  of  a  tree  are  interrelated.   For  example,  given  two  trees  where 

the  second  tree  was  constructed  using  a  number  of  new  features,  suppose  the  second  has 

fewer  nodes  than  the  first  and  a  rank  equal  to  the  rank  of  the  first.    This  represents  an 

improvement  in  performance,  according  to  certain  criteria.  Now  suppose  the  second  tree 

has  more  nodes  than  the  first,  but,  the  rank  of  the  second  is  lower  than  the  rank  of  the 

first.  This  then  also  represents  an  improvement  in  performance.   If  the  size  and  the  rank 


181 


280 
2200 
2000 

m- 
m 
va 
m 
m 
n 
too 
«o 

200 

0 


C4.5  Initial  Tree  Sizes 

Before  lod  After  Feature  Constiuction  (FC) 


itteR 


n  bn  t2  bC  0  W  H  tN  (  tt  llblU M  iS  ti6 


C4.5  Initial  Tree  Ranks 

Befoe  and  After  Feature  ConstructioD  (FC) 


-BtePC 
AtaFC 


n  bd  (2  UZ  0  W  H  bH  I  U  11  blU  brt  rS  lis 


11 

16 

14^ 

12' 

10 

8 

6 

4 

2 

0 


C4.5  Misclassification  Rates 

Before  and  After  Feature  Coosliuction  (FC) 


H  bfl  12  U2  13  bO  M  bit  I  b6  11  b11 14  tr4  (S  btS 


ites 

ram 

C4.5  Tree  Estimates 

Q 

4Si 

Before  and  After  ftiture  CoDstnictioD  (FQ 

lx^«l 

« 

Lead 

—  fcfaelC 

-  itteVC 

35 

30 

p.        —  afarc 

1 

25 

1 

1 
1 

ao 

1  \ 

1 

15 
10 

^A, 

1 

5 
0 

^  U: 

t1U1t2bl2f3b(3Mbt4(b(11b1UbrtiSbrS 


Figure  6.5    Performance  results  for  binary  data-sets 


182 
of  the  second  tree  are  both  higher,  then  performance  has  been  degraded. 

We  observe  from  the  figure  that  the  DNF  functions  exhibited  the  poorest 
performance  overall.  One  explanation  for  this  result  may  be  that  even  though  the  features 
formed  contain  enough  classification  information  to  be  selected  for  inclusion  in  a  tree, 
they  are  not  the  most  useful  features  overall.  However,  the  results  displayed  in 
Figure  6.6,  (showing  the  number  of  features  fornied,  the  number  of  features  used,  and  the 
number  of  edges  in  a  tree  containing  a  feature),  suggests  that,  overall,  DUALTREE's 
features  are  useful  for  our  purposes. 

In  general,  from  our  artificial  data  sets,  we  conclude  that  DUALTREE's  features 
simplify  decision  trees  without  sacrificing  the  accuracy  of  the  tree  to  any  large  degree. 
Also,  in  some  cases,  DUALTREE's  features  may  even  improve  a  tree's  performance  for 
classifying  unseen  examples  that  were  not  used  to  build  the  tree.  Next,  we  examine 
DUALTREE's  performance  using  data  sets  containing  nominal  and  continuous  data.  This 
is  followed  by  a  discussion  on  the  representiveness  of  DUALTREE's  features  for 
descriptions  of  consumer  life-styles  and  expectations. 

6.3    Feature  Construction  Using  Nominal  Data 

This  sections  gives  results  acquired  from  testing  other  domains  represented  by 
several  data- sets  found  in  the  UCl  Repository  of  Machine  Learning  Databases  (Murphy 
and  Aha  1992).  We  studied  the  performance  of  decision  trees  when  nominal  attributes 
are  used  to  form  features,  using  six  data-sets  from  the  repository.  These  data-sets 
represent  three  domains:  (1)  MONK's  Problems-a  set  of  three  artificial  domains  over  the 


183 


iaOn 

120 

110 
100- 


70 
ao 
so 
40 


Featuxe  Use  in  CA-.S  Threes 


r-1     bf  1     f2    t>r2    f3    bfS    «4    b«4    e      b«    1 1  b11    r4   bra    r«    brtf 

DUALTKJEE   Features 

Feature  Use  in  C4-.5   Edges 


Total  B<lcas 


fi    bri    ra   bfz    fs   bra    f'4   bf^-    a     be    1 1  bi  1    i>4  br4   rs   brs 


Figure  6.6   Tests  using  DUALTREE  features 


same  attribute  space;  (2)  Mushrooms  in  terms  of  their  physical  characteristics  and 
classified  as  edible  or  poisonous;  and,  (3)  Tic-Tac-Toe  Endgame,  which  has  a  binary 
classification  task— win  or  lose  for  x. 

Since  full  descriptions  of  these  data-sets,  along  with  past  usage  and  donor 
information,  reside  in  the  repository  (Murphy  and  Aha  1992),  we  describe  them  only 
briefly. 


184 
The  monk's  problems  were  the  basis  of  a  first  international  comparison  of 

learning  algorithms  performed  by  a  collection  of  researchers.  We  examine  three  MONK's 

problems  in  our  experiments:  monks  1 ,  monks! ,  and  monks3 .  MonksS  also  has  5%  of  class 

noise  added  to  the  training  set.   There  are  432  instances  for  each  problem,  and  we  want 

to  classify  each  example  into  one  of  two  classes.  The  domains  for  all  MONK's  problems 

are  the  same  and  consist  of  six  nominally  valued  attributes  and  an  ID  number.  There  are 

no  missing  attribute  values  and  the  following  information  appears  in  Murphy  and  Aha 

(1992): 

Number  of  Instances:  432 

Attribute  information: 

1.  class:  0,  1 

2.  al:       1,2,3 

3.  a2:      1,  2,  3 

4.  a3:      1,  2 

5.  a4:      1,  2,  3 

6.  a5:       1,  2,  3,  4 

7.  a6:       1,  2 

8.  Id:      (A  unique  symbol  for  each  instance) 

Target  Concepts  associated  to  the  MONK's  problem: 
MONK-1:  (al  =  a2)  or  (a5  =  1) 

MONK-2:  EXACTLY  TWO  of  {al  =  1,  a2  =  1,  a3  =  1,  a4  =  1, 

a5  =  1,  a6=  1} 

MONK-3:  (a5  =  3  and  a4  =  1 )  or  (a5  /=  4  and  a2  /=  3) 
(5%  class  noise  added  to  the  training  set) 

Mushroom  depicts  a  database  of  records  drawn  from  the  Audobon  Society  Field 

Guide  to  North  American  Mushrooms,  describing  hypothetical  samples  corresponding  to 

23  species  of  gilled  mushrooms  in  the  Agaricus  and  Lepiota  Family.    The  data-set  has 


185 
8124  instances,  22  nominally  valued  attributes,  and  2480  instances  have  missing  attribute 

values  all  of  which  occur  for  only  one  attribute.    There  are  two  classes:  edible  and 

poisonous.   The  distribution  for  the  classes  is: 


edible: 

4208 

(51.8%) 

poisonous: 

3916 

(48.2%) 

total: 

8124 

Additionally,  the  Audobon  Society  Field  Guide  states  that  there  is  no  simple  rule  for 
determining  the  edibility  of  a  mushroom  (Murphy  and  Aha  1992). 

Tictactoe  represents  a  database  encoding  the  complete  set  of  possible  board 
configurations  at  the  end  of  tic-tac-toe  games,  where  'x'  is  assumed  to  have  played  first. 
There  are  eight  possible  ways  to  create  a  'win  for  x'.  This  data-set  has  958  instances  and 
9  attributes— each  corresponding  to  one  tic-tac-toe  square.  There  are  no  missing  attribute 
values  and  2  classes.  Considering  the  class  distribution,  about  65.3%  of  the  examples  are 
positive. 

6.3.1    Test  Results  using  Nominal  Data 

Nominal  attributes  may  have  more  than  two  discrete  attribute-values-instead  of 
exactly  two  like  binary  attributes.  This  is  a  primary  reason  why  the  rank  of  a  decision 
tree  constructed  using  nominal  attributes  is  undefined.  Thus,  this  dependent  variable  is 
not  available  for  the  initial  trees  constructed  using  the  nominal  attributes.  Instead,  we 
examine  this  parameter  by  focusing  on  the  rank  of  the  trees  constructed  using  a  binarized 
form  of  the  nominal  data-sets.  Recall  that  'binarizing'  the  nominal  data-sets  is  required 
for  producing  the  binary  decision  tree  required  for  DUALTREE  feature  construction. 


186 


Table  6.7    C4.5  results  using  nominal  data-sets 


fn 

# 

atts 

# 

ohs 

Befi 
size 

ore  Pruning 
Errors     m 

size 

, 

After  Pruning         \ 
Errors      Est     mi 

monks  1 
bmonksl 

fbmonksl 

monks2 
bmonksZ 

fbmonks2 

monks3 
bmonksS 

fbmonks3 

6 

17 

19 

6 

17 

22 

6 

17 

19 

^yi 

41 

7 

5* 

157 
113 

111* 

19 
9 

7* 

0(0.0 

72(16.7%) 

72(16.7%) 

78(18.1%) 
0(0.0%) 

6(1.4%) 

0(0.0%) 
0(0.0%) 

0(0.0%) 

1 
1 

4 
3* 

1 
1 

41 

7 

5* 

1 
89 

99 

19 
9 

7* 

0(0.0%)  1    8.5%  .       1 
72(16.7%)'  18.9%  '    1  1 

72(16.7%)!  18.6%  *i    1  1 

142(32.9%)!   34.6%  1       | 
0(0.0%)  ;   13.0%  I  3  i 

10(2.3%)    1    16.7%l    3| 
0(0.0%)   '    4.4%  i       1 

0(0.0%)  1  1.6%  ;  1  j 

0(0.0%)  ;  1J%*  1  1 

mushroom 

bmushroom 

fbmushroom 

22 

125 

134 

8101 

30 
23 
23 

0(0.0%) 
0(0.0%) 
0(0.0%) 

2 
2 

30 

17 
17 

0(0.0%) 
0(0.0%) 
0(0.0%) 

0.3% ; 

0.2%  1  2 
0.2%  1  2 

tictactoe 

btictactoe 

fbtictactoe 

9 

27 
56 

958 

208 
101 
143 

42(4.4%) 
14(1.5%) 
15(1.6%) 

4 
3* 

142 
77 
65* 

60(6.3%) 
23(2.4%) 
31(3.2%) 

18.1% ; 

7.9%  ;  4 
8.1%  1  3* 
L I 

'f  <func-name>  =>  C4.5  using  features  formed  with  Initial  Tree 
BOLDFACE*  — >  Tuipiovguient'  in  performance 


'b'<func-name>  -->  C4.5  uses  'Binarized'  files 


Table  6.7  shows  performance  results  for  each  of  the  data-sets  tested,  along  with 
their  'binarized'  forms.  Again  we  see  that,  in  most  cases,  DUALTREE's  features  give 
improved  performance  for  decision  trees  constructed  using  them.  Note  that  for  the 
mushroom  data-set,  the  final  tree,  fbmushroom,  contains  no  features  formed  by 
DUALTREE.  This  suggests  that,  in  general,  the  primitive  attributes  have  higher 
information  gain  than  does  the  gain  for  the  new  features.  To  a  certain  extent,  this  result 
is  not  very  surprising.  Recall  that  a  feature  is  simply  a  conjunct  of  attributes.  Each  term 
of  a  DNF  concept  can  be  regarded  as  a  feature,  as  well  as  a  rule  for  determining  concept- 


187 
membership.  Thus,  we  may  also  describe  a  concept  using  a  set  of  rules.  Since  no  simple 

rules  exists  for  determining  the  edibility  of  a  mushroom,  this  suggests  that  the  concept 

is  more  appropriately  described  using  solely  the  primitive  attributes. 

From  Table  6.7  observe  that  the  size  of  the  initial  tree  after  adding  the  new 
features  was  larger  only  for  btictactoe  and  fbtictactoe--a.n  increase  of  42  nodes.  Note  that 
fbtictactoe  also  has  a  lower  rank  than  the  tree  represented  by  btictactoe.  The  worst-case 
performance  for  the  misclassification  rate  was  given  by  bmonksl  and  fbmonks2--ar\ 
increase  of  1.4  percentage  points.  However,  note  that  we  improved  performance  in  this 
case  with  regards  to  the  size  and  rank  of  the  initial  trees.  This  case  also  exhibits  the 
worst-case  performance  for  the  estimate—an  increase  of  3.7  percentage  points. 

Table  6.8  gives  feature-formation  and  'feature-usage'  results  for  a  decision  tree 
constructed  using  an  attribute-set  containing  DUALTREE  features.  The  proportion  of 
features  formed  to  the  total  number  of  features  possible  is  25%,  0.49%,  and  12.5%  for 
fbmonksl ,  fbmonksl,  and  fbmonks 3 ,  respectively.  For  fbmushroom  and  fbtictactoe.  this 
proportion  is  0.44%,  or  less  than  one  half  of  one  percent.  Because  the  higher  proportions 
are  associated  with  cases  where  the  number  of  primes  used  is  fairly  small,  we  do  not 
attached  any  inferior  outcomes  to  this  result  which  may  impact  our  previous  conviction 
that  DUALTREE  forms  a  minimal  number  of  finite  features. 

For  each  of  the  test  cases,  the  percentage  of  features  formed  that  were  used  in  the 
tree  is: 


In 

% 

fbmonksl 

50 

fbmonksl 

80 

fbmonksS 

50 

Table  6.8    Feature-formation  results  using  nominal  data-sets 


188 


^ 

# 

primes 

used 

■  ■■■■■■■■■IBIBI 

features 
formed 

# 

features 

used 

max 

size 

formed 

max 
size 
used 

■■■■■■■■■i^i^i^ 

#  g(fecs 

with 

feature 

fbmonksl 

3 

2 

1 

2 

2 

3 

fbmonks2 

10 

5 

4 

3 

2 

8 

fbmonks3 

4 

2 

1 

2 

2 

3 

fbmushroom 

11 

9 

0 

5 

0 

0 

fbtictactoe 

24 

29 

12 

7 

■  IHIHIHIBiaiBII 

4 

32 

■■■■■■■■■■■■■II 

fbmushroom 
fbtictactoe 

0.0 

41.4 

rcentage  of  edges  in 

the  tre 

In 

fbmonksl 
fbmonksl 
fbmonksS 
fbtictactoe 

% 
75 
7.3 
50 

22.5 

Taking  into  account  the  prior  information  concerning  the  existence  of  simple  rules 
describing  mushroom  edibility,  these  results  also  show  that  a  meaningful  number  of  the 
minimal  amount  of  features  formed  by  DUALTREE  are  chosen  by  C4.5's  heuristics. 


189 
6.4   Feature  Construction  With  Continuous  Data 

We  concluded  our  testing  of  DUALTREE's  capabilities  using  two  additional  data- 
sets  from  the  repository.  The  first  was  the  'Credit  Approval'  database--(co-).  This 
database  has  nine  nominally  valued  attributes  and  six  continuously  valued  attributes.  For 
the  second,  'Glass  Identification  Database -{glass),  all  of  the  attributes  are  continuously 
valued.   We  next  give  a  brief  description  of  the  two  data-sets. 

The  'Credit  Approval',  or  crx,  data-set  concerns  credit  card  applications.  There 
are  690  instances  and  we  want  to  classify  each  example  as  '+'  or  '-',  using  nine  nominal 
and  six  continuous  attributes.  Five  percent  of  the  instances  have  one  or  more  missing 
values  for  some  of  the  attributes.  For  the  data-set,  44.5%  of  the  instances  are  classified 
as  '+'. 

The  'Glass  Identification  Database',  or  glass,  represents  a  study  of  classification 

of  glass  motivated  by  criminological  investigation.    The  task  is  to  correctly  identify  the 

glass  left  at  the  scene  of  a  crime.  For  the  data-set,  there  are  214  instances;  9  continuously 

valued  attributes  plus  an  Id#  attribute.  We  also  have  the  following  from  Murphy  and  Aha 

(1992): 

Type  of  glass:  (class  attribute) 
~  1  building_windows_float_processed 
~  2  building_windows_non_float_processed 
~  3  vehicle_windows_float_processed 

-  4  vehicle_windows_non_float_processed  (none  in  this 

database) 

-  5  containers 

-  6  tableware 

-  7  headlamps 


190 


RI: 

1.511: 

2    1.53 

Na: 

10.73 

17.3^ 

Mg: 

0 

4.49 

Al: 

0.29 

3.5 

Si: 

69.81 

75.41 

K: 

0 

6.21 

Ca: 

5.43 

16.19 

Ba: 

0 

3.15 

Fe: 

0 

0.51 

Summary  Statistics: 

Attribute:     Min       Max         Mean       SD        Corr 

)     1.5184   0.0030   -0.1642 
13.4079   0.8166     0.5030 
2.6845    1.4424   -0.7447 
1.4449   0.4993     0.5988 
72.6509   0.7745     0.1515 
0.4971    0.6522   -0.0100 
8.9570    1.4232     0.0007 
0.1750   0.4972     0.5751 
0.0570   0.0974   -0.1879 

Class  Distribution:  (out  of  214  total  instances) 
~  163  Window  glass  (building  windows  and  vehicle  windows) 

—  87  float  processed 
~  70  building  windows 

-  17  vehicle  windows 

—  76  non-float  processed 

-  76  building  windows 

-  0  vehicle  windows 
~  51  Non-window  glass 

—  13  containers 

—  9  tableware 

—  29  headlamps 


6.4.1    Results  using  Continuous  Data 

When  DUALTREE  forms  features  using  continuously  valued  attributes,  the  size 
of  the  feature  may  become  larger  than  the  total  number  of  primitive  attributes.  This  is 
due  to  our  representation  of  features  for  continuously  valued  primitives.  Suppose  'x'  is 
a  continuously  valued  attribute  found  in  a  decision  tree,  and  the  uses  'x'  in  tests  involving 
three  discrete  continuous  values:  6.  3.3,  and  9.  Then  'x  <=  6',  'x  <=  3.3',  and  'x  <=  9' 
are  all  features  of  size  one,  formed  using  the  primitive  'x'.  Suppose  DUALTREE  forms 
a    new    feature    using    these    three    features.       DUALTREE,    which    forms    features 


191 
indiscriminantly,  regards  this  feature  as  having  a  feature-size  of  three.  The  conjunct 
comprising  the  feature  (i.e.,  'x<=6-x<=3.3-x<=9'),  can  be  replaced  by  the  'primitive-test' 
involving  the  lowest  cutoff  value,  or  x<=3.3.  Recall  also  that  DUALTREE  does  not 
'replace'  features  having  several  test  for  the  same  primitive.  Instead,  the  algorithm  selects 
only  those  features  having  at  most  one  test  for  a  continuously  valued  attribute.  This 
'selection  bias'  constrains  the  sizes  of  features  given  by  DUALTREE  such  that  no  feature 
has  a  size  larger  than  the  total  number  of  nominal  and  continuous  primitives  found  in  a 
tree.  Given  a  binary  decision  tree,  the  total  number  of  possible  features  we  can  form 
depends  on  the  total  number  of  primitives  used  in  the  tree  and  the  number  of  times  each 
continuous  primitive  appears  in  a  different  test.  The  maximum  size  for  any  DUALTREE 
feature,  using  the  given  tree,  is  no  larger  than  the  total  number  of  nominal  and  continuous 
primitives  appearing  in  the  tree. 

Table  6.9  displays  our  performance  results  acquired  using  crx  and  glass.  As  an 
example  of  our  previous  discussion,  observe  that  the  number  of  primes  used  for  fbcrx  is 
larger  than  the  total  number  of  attributes  used  for  constructing  bcrx.  Since  the  number 
of  attributes  used  for  constructing  a  tree  is  never  less  than  the  number  of  these  attributes 
found  in  the  tree,  in  our  work,  the  maximum  number  of  features  we  may  form  depends 
on  the  number  of  primitives  used  to  build  the  tree.  For  example,  considering  fbglass, 
DUALTREE  formed  2()/2\  or  3.91%  of  the  maximum  number  of  features. 

Essentially,  the  results  Table  6.9  are  comparable  to  the  results  from  the  previous 
tests.  This  concludes  our  testing  phase  for  DUALTREE.  In  the  following  section,  we 
discuss  results  for  using  the  BEBR  sample  data  and  examining  DUALTREE's  features 


192 


Table  6.9   Continuous  and  nominal  data  results 


fn 


crx 

bctx 

fbcrx 


I  glass 
:  bglass 
:      fbglass 


# 
atts 


15 
47 
84 


9 
9 
29 


# 

obs 


690 


214 


Before  Pruning 


size     Errors     m  size     Errors      Est     m\ 


147 

31(4.5%) 

117 

20(2.9%) 

3 

117 

20(2.9%) 

3 

67 

12(5.6%) 

67 

12(5.6%) 

3 

57* 

14(6.5%) 

3 

After  Pruning 


40 
61 
61 


57 

57 

55* 


1 , — I 

60(8.7%)]  13.7%]      I 

38(5.5%)l  11.7%  I  2\ 

40(5.8%)]  11.8%]  2i 

I  I      \ 

I  >      i 

I  I         z 

14(6.5%)  ]  22.7%  j      I 

14(6.5%)  I  22.7%  I  3  j 

14(6.5%)  ]  22.1%*!  3j 

I  I     : 

I  I      ■ 


T <fimc-naine>  -">  C4.5  xisang  feumts  fbnned  with  Imtial  Tiee         'b'<fuiic-name>  ■■>  C4.5  uses  "Biiiaiized'  files 
BOLDFACE*  -">  Improvement'  in  pezfonnance 
s*  ■■  ■■■  >■  •■■  11  ■■■  ■■  ■■■  ■■  »m  ■■  ■■■  ■■  ■••  ■•  ••■  is*>*  »  ■■■  ■■  ■■■  ■■  ■■■^1  ■■■  ■■  ■<■  ■■  ■■■  11 1  ■  II  III  II  III  II  Ml*  It  III  11  ■■■  ■■  III  ■■  ail  I  ■»  11  III  II  III  ■■  ■ 


fn 


# 

primes 

used 


# 

features 
formed 


# 

features 
used 


max 
size 
formed 


max 
size 
used 


#  edges^ 
with    \ 

feature  \ 

■■■■■•■•■■■■■i< 


fbcrx 


fbglass 


52 


32 


31 


20 


%!■  II  Wl  II  III  II  ■■■  II  ■■■  II  «■  II  III  11  III  11  III  n  III  II  ■!■  II  III  II  ■!■  II  ■!■  II  III  II  ■!■  II  III  II  III  II  III  II  ■!■  ■  III  II  »■  11  ■■■  11  III 


15 


13 


II  ■!■  II  ■!■  II  III  II  UW 


to  see  if  they  represent  plausible  descriptions  of  consumer  life-styles  and  expectations. 


6.5    DUALTREE  Descriptions  of  Consumer  Life-Styles 


The  previous  sections  showed  that,  in  general,  we  have  developed  a  practical 
reasoning  system  for  creating  a  certain  kind  of  knowledge  base.  Now  we  are  interested 
in  using  the  information  given  by  our  tool  for  various  purposes.  For  this  set  of 
experiments  we  used  the  BEBR  surveys  of  consumer  attitudes  and  expectations.  Chapter 
2  described  our  experimental  design  using  the  survey  data.  Recall  that  for  testing  the  first 


193 
hypothesis  proposed  in  Chapter   1,  we  used  consumers'   demographic  attributes  as 

independent  variables  and  the  dependent  variable  represented  consumers'  expectations  of 

their  future  financial  conditions.    For  the  second  hypothesis  proposed  in  Chapter  1,  we 

examined  consumers'  purchase  plans  in  terms  of  their  expectations  regarding  certain 

events.    In  the  following  sections,  fost  we  briefly  describe  our  empirical  design  and 

method.    Next  we  consider  our  experimental  results  using  the  BEBR  survey  data. 

6.5.1    Empirical  Design  and  Method 

Previously,  we  focused  on  performance  in  terms  of  the  size,  misclassification  rate, 
rank,  and  the  estimate  of  a  tree.  Recall  that  the  'estimate'  is  C4.5's  prediction  of  how 
well  a  tree  will  classify  unseen  cases.  Recall  also  from  Chapter  4  that  an  MDLP 
approach  to  constructing  decision  trees  involves,  for  example,  building  a  tree  with  the 
smallest  possible  error  rate  in  classifying  unseen  objects.  In  this  section,  we  extend  our 
analysis  of  the  descriptive  performance  of  DUALTREE's  features  by  examining 
distributions  of  the  class  assignments  using  unseen  cases.  This  requires  that  we  use  a  test 
set,  or  holdout  sample,  of  cases  which  are  not  used  by  C4.5  to  build  a  decision  tree.  C4.5 
also  gives  a  confusion  matrix  for  a  simplified  tree  using  a  test  set.  A  confusion  matrix 
shows  how  a  collection  of  classified-items  are  distributed  among  the  classes.  Of  the  two 
trees  given  by  C4.5--namely  the  'Initial'  and  'Simplified'  trees-the  'Simplified'  tree 
typically  classified  unseen  items  of  our  data  more  accurately  than  the  'Initial'  tree. 
Hence,  in  the  following  discussion  we  list  confusion  matrices  using  the  'Simplified'  tree. 


194 

Table  6.10  Training  and  test  set  class-distributions 

TRAINING  SET  ~  1990  -  1991    (5100  observations) 

..J^JJ£]1^...........»»S3-£»^.......»»£^^L. 

Better  2040  40.0 

Same  2069  40.6 

Worse  635  12.5 

Don't  Know  356  7.0 

TEST  SET  ~  1990  -  1991    (1276  observations) 

..J^y]L^^...........nJi^»SSl^.......«»£^«L. 

Better  498  39.0 

Same  539  42.2 

Worse  147  11.5 

Don't  Know  92  7.2 


The  TIME'  attribute  in  Appendix  A  shows  that  we  used  five  consecutive  quarters 
of  survey  data:  the  last  three  quarters  of  1990  and  the  first  two  quaners  of  1991.  For  this 
time  period  there  are  6376  observations  in  the  data  set.  We  arbitrarily  choose  to  reserve 
20%  of  the  cases  to  form  a  test  set  and  the  remaining  cases  are  used  to  build  a  tree. 
Table  6.10  shows  the  distributions  of  respondents'  replies  to  our  future-financial-question 
for  both  the  training  and  test  sets.  We  observe  that  the  replies  are  distributed  in 
practically  the  same  proportions  for  both  sets.  Note  also  that  the  combined  observations 
for  the  'worse'  and  'don't  know'  answers  account  for  less  than  20%  of  the  total  in  each 
respective  set.  Since  the  proportions  are  about  even  for  the  other  two  sets  of  answers, 
this  suggests  that  C4.5's  heuristics  will  give  rules  that  tend  to  assign  items  to  the 


195 
BETTER  or  SAME  class.   This  is  primarily  because  of  the  numbers  of  observations  in 

each  category.   Also  for  these  categories,  when  we  combine  the  unsure  replies  with  one 

of  them,  almost  one  half  of  the  cases  will  be  members  of  this  category.    The  following 

sections  describe  our  results  using  a  training  and  test  set  for  each  hypothesis  presented 

in  Chapter  1. 

6.5.2   Demographic  Descriptions  of  Financial  Conditions 

We  proposed  in  Chapter  1  that  'unique'  descriptions  exist  for  the  'don't  know' 
category,  and,  we  can  better  study  the  recessionary  period  by  keeping  the  four  alternative 
responses  mutually  exclusive.  To  test  this  hypothesis  we  created  four  training/test  sets. 
In  the  first  set  we  kept  the  four  alternative  answers  mutually  exclusive  when  making  class 
assignments  to  the  observations.  In  the  remaining  three  we  assigned  the  unsure 
respondents  to  one  of  the  remaining  three  categories  (i.e.,  'better',  'same',  or  'worse'). 
For  example,  for  the  second  set,  we  assigned  the  'don't  know'  respondents  to  the  'better' 
category.  We  assigned  the  unsure  respondents  to  the  'worse'  category  in  the  fourth  data 
set.  The  '/n's  for  each  set,  respectively,  are:  (1)  bswdk,  (2)  (bdk)sw,  (3)  b(sdk)w,  and,  (4) 
bs(wdk).  If  our  hypothesis  is  true  then  the  features  and  the  trees  created  using  bswdk 
should  exhibit  the  best  classification  performance  especially  using  the  unseen  cases.  We 
found  evidence,  though  faint,  to  the  contrary,  however. 

Table  6.11  shows  performance  results  obtained  using  our  four  data  sets. 
Considering  the  misclassification  rates,  we  acquired  the  best  overall  performance  when 
we  combined  the  unsure  respondents  with  those  answering  'the  same'.   The  'Initial'  tree 


196 


Table  6.11  C4.5  results  using  BEBR  data-sets 


L & 

# 

atts 

# 

obs 

■  11  ■■■  ■■  ■■■  II  ■■■  II  ■■■   11   ■!■  11   ■■■   II  HIH  1 

Before  Pruning 
size     Errors     m 

BIB  1*  BIB  It  BIB  ii  BiB  i*  BIB  II  BIB  It  BIB  It  BIB  II  BIB  II  BIB  II  Bl» 

After  Pruning 
size     Errors        Est      rn\ 

i       bswdk 
;      bbswdk 

12 
48 

5100 

3037 
2187 

1525(29.9%)  i 
1473(28.9%)  ;  6 

51 
543 

2473(48.5%)  :  50.7% 
2002(39.3%)  148.3% 

5  i 

1      fbbswdk 

236 

2317 

1448(28.4  %)1  5^ 

509* 

1998(39.2%)*;  48.2%* 

4*1 

i      (bdk)sw 
j    b(bdk)sw 

12 
48 

2940 
2185 

1350(26.5%)  i 
1360(26.7%)  ;  6 

137 
425 

2227(43.7%)  :  46.9% 
1906(37.4%)  i  45.2% 

4  ! 

j    fb(bdk)sw 

121 

2295 

1376(27.0%)  :6 

385* 

1949(38.2%)  i  44.9% 

4  I 

!     b(sdk)w 
i    bb(sdk)w 

12 
48 

2975 
2157 

1291(25.3%)  : 
1275(25.0%)  i  6 

148 
497 

2110(41.4%)  i  46.5% 
1758(34.5%)  i  43.9% 

5  i 

1    fbb(sdk)w 

216 

2197 

1240(24.3%)*;  5* 

521 

1755(34.4%)^  43.9% 

5  j 

i     bs(wdk) 
j    bbs(wdk) 

12 
48 

3075 
2261 

1443(28.3%)  ; 
1376(27.0%)  j  6 

194 

475 

2334(45.8%)  j  50.3% 
1976(38.7%)  i  47.7% 

1 
5  i 

j    fbbs(wdk) 

s 

131 

2305 

1410(27.6%)  i  5* 
•■ 

403* 

2074(40.7%)  i  48.0% 

■■  ■•■  li  BIB  11  BIB  11  BIB  II  ^B   II  BIB  11  BIB  I 

4*1 

•••  ••  * 

'f<ftinc-iiame>  -=>  C4.5  using  features  formed  with  Initia]  Tree         'b'<ftmc-nanie>  — >  C4.5  uses  'Binarized'  files 
BOLDFACE*  ■■>  Improvement'  in  performance 


created  using  C4.5  and  DUALTREE's  features  misclassified  24.3%  of  the  5100 
observations.  Keeping  the  categories  of  answers  mutually  exclusive  gave  a  tree  having 
the  highest  misclassification  rate  of  the  four— 28.4%,  although,  this  may  not  be  a 
comparable  result  since  it  is  fundamentally  harder  to  classify  objects  into  four  classes  as 
opposed  to  three.  Thus,  it  appears  that  we  must  examine  other  aspects  of  the  trees  in 
order  to  make  a  fair  judgement  of  their  performance.  The  misclassification  rates  are 
fairly  close  for  the  other  two  trees  (i.e.,  27.0%  for  fb(bdk)sw  and  27.6%i  for  fbbs(wdk)). 
DUALTREE  formed  168  'useful'  features  using  the  tree  represented  by  bb(sdk)w  and 


197 
37.5%  of  them  are  used  in  the  tree  represented  by  fbb(sdk)w.   Also,  28.6%  of  the  2196 

edges  in  the  tree  have  a  new  feature  as  a  predecessor  or  successor. 

Table  6.12  shows  confusion  matrices  using  test  cases  for  our  four  data  sets.  To 
interpret  each  matrix,  the  sum  of  the  numbers  in  each  row  equals  the  respective 
frequency-counts  given  in  Table  6.10  for  the  test  set.  Each  row  shows  how  the  class- 
assignments  were  distributed  among  the  classes  using  a  'Simplified'  tree.  Observe  that 
practically  all  of  the  cases  for  the  'worse'  category  are  misclassified.  Also,  the  tree 
represented  by  fbbswdk  assigned  none  of  the  92  cases  for  the  'don't  know'  category  to 
the  DON'T  KNOW  class. 

Recall  we  suggested  that,  when  comparing  two  trees  that  describe  the  same  sample 
using  a  different  number  of  classes,  we  require  further  examinations  of  parameters  for  the 
trees.  For  example,  suppose  we  also  examine  the  number  of  objects  in  a  category 
correctly  assigned  to  the  class  corresponding  to  the  category.  The  higher  the  number  the 
more  accurate  the  assignment  rules  for  the  class.  Considering  this,  we  can  say,  for 
example,  that  the  assignment-rules  associated  with  the  DON'T  KNOW  class  in  the  tree 
represented  by  fbbswdk  are  not  very  accurate  since  none  of  the  nine  objects  assigned  to 
this  class  belong  to  the  'don't  know'  category.  This  is  an  unfavorable  result  regarding 
this  tree's  performance. 

From  Table  6.12  we  see  that  the  tree  represented  by  fb(bdk)sw  classified  the  most 
items  correctly.  This  tree  misclassified  23%  and  62%  of  the  items  in  the  'better'  and 
'same'  categories  respectively.  The  tree  represented  by  fbb(sdk)w  correctly  classified  the 
second  highest  number  of  items  and  misclassified  54%  and  36%  of  cases  in  the  'better' 


198 


Table  6.12  Confusion  matrices  for  the  four  data  sets 


Confusion  Matrices  using  C4.5  Simplified  Trees 


fbbswdk 

k^}  &).  .(P.)  k^)  <==classified  as 

295  190  11  2  (a):  class  BETTER 

220  295  19  5  (b):  class  SAME 

58  82  5  2  (c):   class  WORSE 

35      52         5       0   (d):    class  DONT 

KNOW 
46.6%  of  1276  items  correctly  classified 


fbb(sdk)w 

.W,  .ft)  if.)  <==classified  as 

227  253  18  (a):   class  BETTER 

204  407  20  (b):   class  SAME 

32  113  2  (c):   class  WORSE 

50.0%  of  1276  items  correctly  classified 


fb(bdk)sw 

(a)      (b)      (c)      <==classified  as 

■■W I  ii  ■■■■•  I  ■■■ t% 

454     125      11     (a):   class  BETTER 
316     207      16     (b):    class  SAME 
86      58       3      (c):   class  WORSE 

52.0%  of  1276  items  correctly  classified 

fbbs(wdk) 

(?},  ft)  k^).  <==classified  as 

352  120  26  (a):   class  BETTER 

268  235  36  (b):    class  SAME 

105  116  18  (c):   class  WORSE 

j    47.4%  of  1276  items  correctly  classified 


•1 — 


and  'same'  categories  respectively.   The  tree  represented  by  fbbswdk  correctly  classified 
the  lowest  number  of  items. 

At  this  point  it  appears  \hatft>(bdk)sw  may  also  represent  a  plausible  hypothesis 
for  the  target  concept.  Our  task  now  is  to  determine  which  of  the  two  representations 
(i.e.,fbb(sdk)w  and  fb(bdk)sw)  is  the  more  feasible  one.  When  we  combine  the  unsure 
respondents  with  respondents  answering  'the  same',  we  are  implying  that,  for  our  time 
period,  demographic  descriptions  of  unsure  respondents  more  closely  resemble 
descriptions  of  respondents  answering  'the  same'  than  those  for  any  of  the  remaining  two 


199 
alternative  replies.     Similarly,  fb(bdk)sw  suggests  that  demographic  descriptions  for 

'better-off  and  unsure  respondents  are  alike  during  our  time  period.  To  determine  which 

of  the  previous  statements  is  more  correct,  we  considered  DUALTREE  and  the  features 

formed  by  the  algorithm. 

Recall  that  DUALTREE  is  based  on  formal  theories  and  creates  a  virtual  tree  for 
each  class.  All  paths  in  the  virtual  tree  for  a  class  lead  to  a  null  node,  or,  a  leaf  for  the 
class.  Also  recall  Wegener's  (1987)  result  that  it  is  sufficient  to  specify  f '(1)  or  f '(0)-- 
where  f(x)=l  or  f(x)=0--if  f  is  an  element  of  the  set  of  Boolean  functions.  This 
information  suggests  that  each  virtual  tree  or  hypothesis  is  sufficient  for  describing  a  class 
since  the  paths  leading  to  the  leaves  in  the  tree  represent  f'(l)--if  a  '1'  implies  class- 
membership.  Also,  if  a  virtual  tree  contains  features  that  are  not  found  in  any  of  the 
other  virtual  trees,  then  it  must  be  a  different  virtual  tree  from  the  others  since  it  contains 
'unique'  features  for  the  class.  We  used  this  information  in  the  following  manner.  For 
a  given//!,  we  examined  each  virtual  tree  and  formed  a  list  of  the  'unique'  features  for 
each  class.  Next,  we  removed  every  feature  from  the  list  which  was  not  used  in  the 
decision  tree  constructed  by  C4.5  using  DUALTREE's  features.  Finally,  using  the 
original  data-set  of  6376  observations  and  our  list  of  'unique'  features,  we  performed 
certain  analyses  and  interpreted  those  results. 

Recall  from  Chapter  4  that  we  suggested  three  properties  or  characteristics  of 
useful  features.  We  examined  our  unique  features  to  determine  if  they  possessed  two  of 
these  characteristics,  and  if  so,  to  what  extent.  The  characteristics  we  examine  include 
the  following:  (1)  does  a  unique  feature  correspond  to  a  subspace  of  lower  dimensionahty 


200 
than  the  original  n-dimensional  feature  space?,  and,  (2)  does  a  unique  feature  partition  a 

sample  such  that  at  least  one  of  the  partitions  contains  a  predominance  of  examples 

belonging  to  a  single  class?  When  studying  class  membership,  ideally,  the  predominance 

of  examples  should  belong  to  the  class  associated  with  the  virtual  tree  that  the  unique 

feature  appears  in.   Recall  that  a  virtual  tree  for  a  class  classifies  an  object  as  a  member 

of  the  class,  or,  is  not  a  member  of  the  class.    Hence,  a  feature  is  just  as  useful  if  the 

predominance  of  examples  belong  to  a  different  class  or  classes. 

Finally,  we  examined  the  descriptive  power  of  our  unique  features  with  respect  to 
consumers'  buying  plans  for  certain  items  and  the  recession.  If  our  unique  features 
represent  'typical'  demographic  descriptions  of  consumers'  financial  expectations  during 
the  recessionary  period,  we  should  find  a  relatively  equal  proportion  of  these  types  of 
consumers  appearing  in  each  time  period.  As  an  additional  illustration  of  examining 
DUALTREE's  features,  recall  from  Chapter  1  that  a  sudden  increase  in  the  number  of 
unsure  respondents  to  our  future-financial-question  occurred  during  the  recessionary 
period.  Now  suppose  that,  after  reviewing  the  time-period  information,  we  determine  that 
our  unique  features  describe  'typical'  consumers.  If  our  'typical'  consumers  are 
associated  with  the  phenomenon  regarding  the  surge  in  unsure  replies,  then  we  should 
find  changes  in  their  responses  which  coincide  with  the  phenomenon.  The  changes  in 
their  replies  should  be  even  more  apparent  if  this  is  a  case  where  the  unsure  respondents 
have  been  combined  with,  for  example,  'better',  'same',  or,  'worse'. 

Considering  consumers'  buying  plans,  we  examined  their  responses  to  two 
questions  on  the  survey  which  are  (1)  do  you  think  now  is  a  good  or  a  bad  time  for 


201 
people  to  buy  major  household  items?-GBTIME,  and,  (2)  does  anyone  in  the  household 

plan   to   buy   a  car  or  truck?— PLANBC.      This,   in   effect,  represents   independent 

information  which  is  not  affected  by  our  action  of  combining  the  unsure  respondents  with 

one  of  the  other  three  categories.  Here  we  simply  check  our  unique  features  to  see  if  they 

discriminate  between  consumers  having  definite  views  on  purchasing  certain  items  and 

those  that  do  not.    The  next  two  sections  describe  our  examination  of  DUALTREE's 

'unique'  features  for  fbb(sdk)w  and  fb(bdk)sw. 

6.5.2.1    Descriptions  of  'the  same'  and  unsure  consumers 

Using  DUALTREE's  three  virtual  trees  fox  fbb(sdk)w,  (i.e.,  a  tree  for  BETTER, 

SAME,  and  WORSE),  we  identified  seventeen  unique  features  belonging  to  class 

'SAME'.   Only  four  of  these  unique  features  were  used  by  C4.5.   The  'BETTER'  class 

possessed  two  unique  featiu-es,  however,  neither  of  them  appeared  in  the  tree  produced 

by  C4.5.   We  found  no  unique  featvu^es  for  the  'WORSE'  class.   Our  list  of  four  unique 

features  for  the  'SAME'  class  included  the  following: 

1.  (EDUCAT  =  College  <fe<fe  TIME  =  91 Q2) 

2.  (MSA  =  Yes  cfecfe  RACE  =  Other  &&  SEX  =  Female) 

3.  (EDUCAT=  HighSchoolGrad  &&  HOSIZE  =  3  && 
INCOME  =  $45K-$75K) 

4.  (EDUCAT=  CoUegeGrad  &&  MARRY  =  Widowed) 

Table  6.13  shows  certain  results  for  the  number  of  consumers  in  the  original  data-set  who 
conform  to  the  descriptions  represented  by  each  feature. 

We  see  from  the  table  that  our  features  or  descriptions  account  for  5.5%,  .5%, 
.8%,  and  1.1%  of  the  6376  observations,  respectively.  Table  6.13  also  shows  the  number 


202 


Table  6.13  Features  in  a  virtual  tree  for  the  SAME  class 


Unique  Features  for  FBB(SDK)W 


# 

feature 

FUTFEV 

feature 

obs 

% 

edges 

better    same    worse 

EDUCAT-College, 

161 

151     ;      37  1 

TIME=91Q2. 

353 

5.5 

17 

46% 
J 

44%  1     10%  g 

MSA=Yes, 

RACE=Other, 

9 

23    ;     1    i 

SEX=Female. 

33 

0.5 

3 

27% 

70%;     3%  1 

EDUCAT=ffighSchoolGrad, 

H0SIZE=3, 

22 

23     !      3     i 

INCOME=$45K--$75K. 

48 

0.8 

10 

46% 

48%  ;     6%  j 

EDUCAT=CollegeGTaduate, 

10 

"> i 

41   ;    14  1 

MARRY=Widowed. 

65 
■■■■■■■■■II 

1.1 

!■■■■■■■■■■ 

6 

15% 

63%  1    22%  1 

of  edges  in  the  tree  produced  by  C4.5  that  uses  each  feature,  along  with  distributions  of 
consumers'  replies  to  our  future-financial-question.  Considering  the  second  group  of 
consumers,  70%  of  them  belong  to  the  'same'  category.  Hence,  this  feature  does  an 
adequate  job  of  partitioning  the  sample  where  a  predominance  of  the  examples  belong  to 
like  categories.  Also  for  this  case,  the  dominating  category  is  identical  to  the  class 
associated  with  the  virtual  tree  it  appears  in.  On  the  other  hand,  considering  the  first 
group  of  consumers,  44%  of  them  belong  to  'same'  category  and  46%-  are  in  the  'better' 
category.    However,  we  can  say  that,  for  the  most  part,  this  feature  typically  describes 


203 
consumers  in  one  of  two  categories— which  effectively  eliminates  the  third  category  from 

consideration. 

Table  6.14  shows  distributions  of  respondents'  replies  with  respect  to  their  buying 
plans  during  the  recessionary  period.  We  see  from  the  table  that  the  second  feature 
discriminates  well  between  consumers  having  definite  buying  plans  (i.e.,  good  and  bad, 
or,  yes  and  no),  and  those  with  indefinite  ones  (i.e.,  maybe  and  don't  know).  The  number 
of  consumers  fitting  this  description  (i.e.,  female  consumers  of  race  'other'  living  in  a 
metropolitan  statistical  area),  appear  in  the  following  percentages:  24.2%,  30.3%,  12.1%, 
and  33.3%,  for  time  periods  90Q2  through  91Q1  respectively.  On  the  other  hand, 
widowed  college  graduates  tend  to  have  definite  as  well  as  indefinite  views  on  whether 
or  not  it  is  a  good  time  to  buy  things-regardless  of  their  future-financial  expectations. 
The  percentages  for  this  group  of  consumers  is  16.9%,  16.9%,  21.5%,  24.6%,  and  20.0% 
for  periods  90Q2  through  91Q2  respectively. 

In  general,  these  unique  features  appear  to  represent  plausible  demographic 
descriptions  for  consumers  during  our  time  period  when  compared  to  other  types  of 
descriptions  like,  for  example,  single  females  over  the  age  of  45  earning  less  than  $20,000 
a  year  who  also  thought  that  they  would  be  financially  better  off  a  year  following  the  first 
quarter  of  the  recession.  Along  with  our  previous  results,  we  appear  to  have  faint 
evidence  for  accepting  the  tree  represented  by  fbb(sdk)w  as  a  reasonable  description  of 
the  data.  Next,  we  discuss  the  unique  features  appearing  in  the  tree  represented  by 
fh(hdk)sw. 


Table  6.14  Demographic  descriptions  of  consumer  buying  plans 


204 


EDUCAT=College  &&  TIME=91Q2 


FUTFINi 


GBTIME 

Don't 
Good  Uncral   Bad  Know 


PLANBC 

Don't 
Yes   Mayfae   No   Know 


TIME 


90Q2  90Q3  90Q4  91Q1  91Q2 


T" 


-r 


Better 


87 
I  54% 

Same    \  46% 


!   16 

Worse  I  43% 


17      51      6 
11%,32%,  3% 


30    ,  48   ,  6 
19%  31%.  4% 


30     11 
19%,  7% 

11%  5% 


6   1  15  I  0 
16%' 41%  0% 


4i  0 
11%   0%' 


119 
74% 


130 
84% 


33 
89% 


1 
<1%I 


0 
0% 


0 
0% 


0 
0% 


0  ;  0  ,0  ;i6i 

0%i     0%i    0%il00% 

■■-"-■ r— r"-'— r— I 
0  I 

0% 


0    I   0     ;  155 
0%   0%  1 100% 


0 
0% 


0  I    0    I    0    I    0    I  37 

0%   0%    0%  o%aoo% 


MSA^Yes  &&  RACE=Other  &.&.  SEX=Female 


FUTFINi 


GBTIME 

Don't 
Good   Uncert   Bad  Know 


PLANBC 

Don't 
Yes  Maybe  No  Know 


TIME 


90Q2  90Q3  90Q4  91Q1  91Q2 


Better 
Same 
Worse 


5       0  ;  4  J  0 

56%,     0%,44%!  0% 


1     , 
100% 


1    '    0  1    8    i  0 
11%   0%,  89%  0% 


3 
33% 


Tiz'T"  o"  I' if T'o' 

48%,     0%i  52%   0% 


"eT  o'T  'nT'd 

26%   0%   74%   0% 


5 
22% 


0    , 
0%' 


0  ,    0 
0%   0%| 


0  ,    0  I      1   1  0 
0%  0%il00%0% 


0 
0% 


4     ,     0,2; 
45%  I     0%,  22%, 


0 
0% 


5,4,9,     0 
22%    17%   39%     0% 


1    I     0    I     0    I      0 
0%'     0%'    0%i     0% 


EDUCAT=HighSchoolGrad  A<fe  HOSIZE=3  &&  INCOME='$45-$75K 


I        GBTIME 

FUTFIN!  Eton't 

■    Good  Uncert   Bad  Know 


PLANBC 

Don't 
Yes   Maybe   No  Know 


TIME 

90Q2    90Q3    90Q4    91Q1    91Q2 


I12 


0       8 


Better  i  55%  I     0%|  369^   9% 


4:0;  18 ;  0 

18?^  0%,  83%  0% 


2     15. 
9%   ,  23%, 


i    12        2    ;    7       2 
Same    i  52%      9%, 30%,  9% 


0     1  ;  0 

0%|33%1  0% 


Worse  i  67% 


13%|  13%,  74%  0% 
33%!  33%,  34%,  0% 


9%  I  17% 

■'o"-TT' 
0%  ,  33%, 


8,43 

36%,  18%,  14% 


6        8  3 

26%,  35%,    13% 


0         1  J      1 
0%,   33%    34% 


EDUCAT=CollegeGrad  &&  MARRY=Widowed 


FUTFINi 


GBTIME 

Don't 
Good  Uncen.   Bad   Know 


PLANBC 

Don't 
Yes   Maybe   No  Know 


TIME 

90Q2  90Q3  90Q4  91Q1  91Q2 


Better 
Same 
Worse 


6         2       2    !   0 
60%  i   20%,  205^   0% 


2 :  0 ;  8,0, 

20%   0%   80%   0% 


17         8        11  :    5    , 
42%|    19%  27%  129 


10 


6    1      1         6j     1 
43%1    7%  I  439^   7% 


1,1. 
10%,  10%, 

12:  1  ;  35 ;  0  J  -f'Y'e'l 

ni   3%   85%   0^     17%  15% 

■T-T-'4-T'"2-r-^- 

21%,  29%   14%,  22% 


2,6 
20%,  60%, 


0 
0% 


11 


24%,  17%,  27% 


6 ;  oy""i4T "6" 

Oi   0%  100%  0% 


2 
14% 


205 
Using  the  virtual  trees  for  ft}(bdk)sw,  we  found  five  unique  features  appearing  in 

the  tree  for  class  BETTER,  however,  only  one  of  them  appeared  in  the  C4.5  tree.    The 

SAME  class  possessed  three  unique  features  and  all  of  them  appeared  in  the  tree 

constructed  by  C4.5.   We  found  no  unique  features  appearing  in  the  virtual  tree  for  the 

WORSE  class. 

Table  6.15  shows  distributions  of  respondents'  replies  for  the  unique  feature 

belonging  to  the  BETTER  class.   This  group  represents,  for  example,  female  consumers 

living  alone  whose  annual  income  is  less  than  $25,000.    The  percentage  of  these  408 

consumers  in  each  of  the  time  periods  is  15.2%,  15.4%,  15.7%,  35.8%,  and  17.9%,  for 

90Q2  through  91Q2  respectively.   Table  6.15  also  shows  that  this  feature  partitions  the 

sample,  when  the  unsure  respondents  are  combined  with  'better'  respondents,  such  that 

67%  of  this  group  of  consumers  did  not  reply  'better  off  (i.e.,  is  not  a  member  of  the 

category).  This  feature  does  not  discriminate  well  among  consumers  having  definite  and 

indefinite  views  on  purchasing  major  household  items,  and  it  also  appears  that  this  group 

of  consumers  typically  do  not  plan  to  purchase  a  car.   Considering  the  time  distribution, 

observe  that  the  increased  percentage  during  91  Ql  was  rather  evenly  distributed  among 

the  alternative  replies  to  oiu*  question.  Recall  that  this  combination  of  responses  suggests 

that  demographic  descriptions  of  unsure  respondents  more  closely  resemble  descriptions 

of  consumers  who  were  'better  off,  during  the  recessionary  episode.    Hence,  it  appears 

that,  for  example,  during  the  last  of  three  periods  of  negative  growth  for  the  U.S. 

economy,  many  single  females  earning  less  than  $25,000  a  year  felt  that  they  would  be 

financially  better  off  in  the  first  quarter  of  1992.    Also,  only  37%  of  them  thought  that 


206 


Table  6.15  Descriptions  of  'better'  and  unsure  respondents 


Unique  Features  for  FB(BDK)SW:  Class=BETrER 


feature 


#  feature  FUTFEV 

obs       %        edges    bener    same    worse 


65 

16% 


HOSIZE=l, 

INCOME-<$25K, 

134 

209 

SEX=Female. 

408 

6.4 

33 

33% 

51% 

HOSIZE=l  &&  INCOME=<$25K  &&  SEX=Female 


GBTIME 


FUTFINi  Don't 

i    Good  Uncert.   Bad  Know 


PLANBC 

Don't 
Yes   Maybe  No   Know 


TIME 


90Q2    90Q3     90Q4    91Q1     91Q2 


i     50  ;      25  ,    41 i    18 

Better  i  37%i    19%J  31%  139| 
..^.........._...  ...        . 

Same    i  45%!    18%!   295fe   8^ 

!    21  1     11  '    26 1    7 
Worse  i  32%;   17%;  409^  11% 


6T 


7  .    120.     1 

5%  5%;^90%;<193 


19 
14% 


8'    8  1    193'     0, 
H   4%   92%   09? 

"o""T""rt""62"'""2"" 
0%'  2%;  95%;    3%| 


35 

17% 


8 
12% 


19    I    22  .    56  ,     18 
14%;    169^   42%    14% 


341      3i:     65'    44 
16%    15%    31%   21% 


10'      11'     25'     11 
15%;    179^    399?    17% 


is  was  a  good  time  to  purchase  major  household  items.  These  results  suggest  that  this 
feature  may  not  represent  a  plausible  description  for  consumer  life-styles  during  the  recent 
recession. 

Table  6.16  shows  information  for  three  unique  features  appearing  in  a  virtual  tree 
for  the  SAME  class.  Only  the  first  feature  partitions  the  sample  where  one  of  the 
partitions  contains  a  predominance  of  examples  that  belong  to  a  like  category— and  this 
category  is  not  associated  with  the  SAME  class.  Considering  this  group  of  consumers, 
(i.e.,  people  employed  by  the  military  living  in  a  household  of  size  five),  recall  that  this 


207 


Table  6.16  Other  'better'  and  unsure  consumer-descriptions 


Unique  Features  for  FB(BDK)SW:  Class=SAME 


feature 

■■■■■laiiBii 
# 

obs 

% 

feature 

edges 

FUTFIN      1 

better    same    worse  \ 

HOST7,F,=5, 
OCCUPA=Mi1itaryorMisc. 

36 

5.6 

3 

18 

50% 


124 
52% 

33 
52% 

10 

28% 
t 

8 

22% 

L„ , 

SEX=Feniale, 
riAlE=90Q2. 

241 

3.8 

21 

99 

41% 

18 

7% 

EDUCAT=ffighSchoolGrad, 

H0SIZE=3, 

Pn)=Independeiit. 

64 

1.0 

■■■■■■■■■■II 

8 

27 
42% 

■IHIHIHIHII 

4 
6% 

■■■■■■■iiaii 

was  a  time  period  witnessing  major  decreases  in  defense  spending.  This  information 
suggests  that  this  feature  may  not  be  a  likely  description  of  consumers  employed  by  the 
military  who  tend  to  answer  our  question  with  'better  off,  during  a  recessionary  episode. 
The  other  two  features  partition  the  sample  where  the  number  of  examples  in  each 
partition  is  about  even-even  though  this  is  a  case  where  the  unsure  respondents  have 
been  combined  with  'better'. 

Regarding  buying  plans  for  these  groups  of  consumers.  Table  6.17   shows 
distributions  of  replies  for  the  three  unique  features  appearing  in  the  SAME  virtual  tree. 


208 


Table  6.17  Other  consumer  descriptions  regarding  buying  plans 


OCCUPA=MilitaryorMisc.  &&  H0SIZE=5 
GBTEVfF,          PLANBC              TIME 

FUTFIN-                                 Dcm't                             Don't 

Good  Uncert.  Bad  Know    Yes  Maybe  No  Know     90Q2    90Q3    90Q4    91Q1    91Q2 

Better 

1    \    2    16  13 

3996'   1196:3396' 1796 
1         1     ._i 

3  !  1    !  14  1   0 
1796  696;  789^  0% 

2  1  2  !  4  !  4  ;  6 
11%;  119^  229^  229^34% 

Same 

6  ;  1  ;  2  J  1 

6096.    1096-2096.1096 

1 '  1 :  8  ;  0 

1096  10%  80%  0% 

2 !  0 !  2 :  3  ;  3 

20%    0%  20%  30%.  30% 
J 1 J _i 

Worse 

38%;  12961509^  0% 

1 

Ij    2  J    5  ;  0 
129il  259i  63%.  0% 

1               I 

2  ;   1  I    3  :  2  ;  0 

259i   1396  3796  25%!   0% 

OCCUPA=Sales  &iSc  SEX=Female  iSaSc  TIME=90Q2 


FUTFE^ 


GBTIME 

Don't 
Good  Uncert.  Bad  Know 


PLANBC 

Don't 
Yes  Maybe  No  Know 


TIME 


90Q2    90Q3    90Q4    91Q1    91Q2 


Better 
Same 
Worse 


67 
54% 


57 
58% 


11 

61%: 


15  .  25 ;  17 
12%:  20%  1491 


161    7.    100    1 
13%  6%;  81^  <1 


91 


124  .    0  .    0   .    0   .    0 
1009^    09^     09^    0%;    0% 


10 ;  23 ;  9 

10%  23%  993 


4  ;  86  ;  0 

4%  87%  09a 


99  ;  0  :   0  ;  o  ;  o 

100%    0%'     0%'    0%.    0% 

J J . 1 


0  ;    5  ;    2 
0%l28%!ll% 


2  ;  0  ;  16  J  0 

11%I0%,  899S  0% 


18 ;    0:    0 ;  0 

10096    096    0%.    0! 


0 
0% 


EDUCAT=HighSchoolGrad  <fe<fe  H0SIZE=3  ik&  PID=  Independent 


FUTFIN! 


GBTIME 

Don't 
Good  Uncert.  Bad  Know 


PLANBC 

Don't 
Yes  Maybe  No  Know 


TIME 


90Q2    90Q3    90Q4    91Q1    91Q2 


Better 
Same 
Worse 


14 
42% 


10 

37% 


0 
0% 


3  ;  15  ;  1 

9%;  469^  3% 


3 ;  0 
9%:  0% 


5  ;  10  ;  2 
i9r^  37^7% 


6 ;  1 
22%  4% 


0  ;  4  ;  0 

0%'100%  098 


30 
91% 


20 

749& 


1  ;  0 ;  3 

25%  09^75% 


0 
0% 


5 
15% 


0 
0% 


3 
11% 


0 
0% 


0 
0% 


8   .    12  .     8  1     0 
24%;    379?  249^    0% 


6  ■■[■■■■■f3";"  5    ;   0 
229^    489^  19%;    0% 


0 


0 


50%:  0%:  50%'  0% 


209 
The  third  feature  is  the  only  one  that  does  a  fair  job  of  discriminating  between  consumers 

having  definite  and  indefinite  plans  for  buying  a  car.   Considering  the  first  feature,  only 

39%  of  them  thought  that  it  was  a  good  time  to  buy  things— even  though  50%  of  this 

group  of  consumers  felt  'better  off. 

After  considering  the  decision  trees  represented  by  fbb(sdk)w  and  fb(bdk)sw,  it 

appears  that  fbb(sdk)w  represents  the  more  plausible  hypothesis  for  the  target  concept. 

This  implies  that,  during  our  time  period,  demographic  descriptions  of  unsure  respondents 

regarding  their  future  financial  expectations  more  closely  resemble  descriptions  of 

respondents  who  replied  'the  same' ,  than  any  of  the  other  alternative  replies.  This  result 

does  not  support  our  hypothesis  regarding  demographic  descriptions  for  consumers  during 

the  recession.    However,  due  to  the  relatively  low  number  of  'don't  know'  respondents 

in  the  sample  and  because  of  the  inherent  difficulties  involved  when  comparing  decision 

trees  that  assign  the  same  objects  to  a  different  number  of  classes,  our  result  may  not 

represent  a  general  condition. 

6.5.3    Describing  Consumer  Buying  Plans 

An  additional  hypothesis  proposed  in  Chapter  1  was  that  changes  occurring  among 
certain  types  consumers'  attitudes  best  explain  the  change  in  consumer  unwillingness  to 
buy  automobiles  during  the  recessionary  and  recovery  periods.  We  suggested  using 
respondent  replies  to  the  question  on  the  survey  asking  if  anyone  was  planning  on  buying 
a  car  or  truck--PLANBC.  The  independent  variables  included  CURFIN,  FUTFIN, 
USFUFl,  USNEX5,  and,  GBTIME.   Our  goal  was  to  examine  descriptions  of  consumer 


210 
intentions  to  purchase  cars,  however,  we  encountered  several  problems  when  attempting 

to  use  this  approach.  The  first  problem  was  that,  of  the  6376  observations  in  the  original 

data-set,  194  or  3%  of  them  had  missing  information  for  PLANBC.    To  resolve  this 

problem  we  simply  removed  the  observations  containing  missing  information.    For  the 

remaining  6182  observations,  the  respondent  replies  were  distributed  as  11.9%,  5.1% 

82.6%,  and  0.4%,  for  'yes',  'maybe',  'no',  and  'don't  know'  respectively.    Given  this, 

note  that  a  decision  tree  consisting  of  a  single  leaf  node  labeled  NO  misclassifies  only 

17.4%  of  the  items. 

An  additional  problem  is  that  consumers  may  not  be  considering  the  purchase  of 
an  automobile  for  reasons  unrelated  to  the  factors  we  are  examining.  For  example, 
suppose  every  adult  household-member  owns  a  car  that  is  less  than  say  three  years  old. 
This  type  of  consumer-household  may  not  consider  buying  a  car  simply  because  one  is 
not  needed.  Also,  there  are  no  questions  on  the  BEBR  survey  that  allow  us  to  separate 
the  'no'  respondents  into  groups  that,  for  example,  have  no  need  for  purchasing  a  car, 
and,  those  who  may  be  waiting  on  business  conditions  to  become  more  favorable  for 
buying  a  car.  Hence,  it  is  apparent  that  we  need  additional  information  to  further 
examine  this  hypothesis. 

We  instead  chose  to  examine  certain  types  of  descriptions  for  consumer  attitudes 
regarding  whether  or  not  it  is  a  good  time  to  buy  major  household  items.  This  serves  as 
an  additional  and  final  illustration  of  how  our  empirical  tool  can  be  used.  We  modify  our 
objective  of  Chapter  1  so  that  we  now  want  to  examine  buying  plans  for  major  household 
items  (i.e.,  GBTIME)  in  terms  of  consumer  expectations  for  personal  finances  and  the 


211 
national  economy.  Keeping  the  'theme'  of  our  original  hypothesis,  we  now  propose  that 

changes  taking  place  in  the  'don't  know'  category  of  consumer  expectations  for  personal 

finances  and  the  national  economy  best  explain  consumer  views  regarding  the  purchase 

of  major  household  items  during  the  recent  recession.   To  test  our  hypothesis,  we  used 

two  sets  of  independent  attributes  and  determined  which  set  performed  best  with  respect 

to  decision  trees  constructed  by  C4.5.   The  dependent  attribute,  GBTIME,  is  distributed 

in  our  training  and  test  sets  as  shown  in  Table  6.18. 

The  first  set  of  independent  attributes  include:  (1)  CURFIN,  (2)  FUTFIN,  (3) 
USFUFl,  (4)  USNEX5,  and,  (5)  TIME.  These  attributes  represent  the  BEBR  index- 
component-questions  described  in  Chapter  1.  Using  these  attributes,  a  decision  tree 
containing  DUALTREE  features  offers  descriptions  of  consumer  buying  plans  in  terms 
of  certain  consumer  expectations,  during  our  time  period.  The  independent  attributes  for 
the  second  set  are  (1)  FUTHN,  (2)  AGE,  (3)  INCOME,  (4)  SEX,  and,  (5)  TIME.  Using 
these  attributes  in  a  decision  tree  produces  a  relationship  between  consumer  buying  plans, 
and,  expectational  and  demographic  factors  of  consumers.  Essentially,  in  forming  the 
second  set,  we  replaced  the  'expectational'  attributes  of  consumers  with  widely  used 
demographic  attributes  for  studying  consumer  buying  plans.  The  respective  fn's 
associated  with  the  two  sets  are  gbtime  and  buyfais.  Table  6.19  shows  C4.5  results  using 
our  two  sets  of  attributes  along  with  other  results  for  the  features  formed  from  them. 

Considering  the  initial  trees,  the  trees  represented  by  fbgbtime  and  fbbuyfais  both 
have  a  rank  of  five  and  fbgbtime  has  the  lower  misclassification  rate  of  the  two~42.8% 
vs.  47.2%.    Hence,  fbgbtime  does  a  better  job  of  describing  the  data.    Note  also  that. 


212 
Table  6.18  Sample  distributions  using  GBTIME 

TRAINING  SET  --  1990  -  1991    (5100  observations) 

GBTIME^    ,.,.,.  F£;^Ji£iic^,.,.,.P2£cej?i 

Good  2315  45.4 

Uncertain  656  12.9 

Bad  1762  34.5 

Don't  Know  367  7.2 


TEST  SET  --  1990  -  1991    (1276  observations) 

Good  590  46.2 

Uncertain  152  11.9 

Bad  436  34.2 

Don't  Know  98  7.7 


using  DUALTREE  features,  we  were  able  to  improve  performance  in  each  case. 

Table  6.19  also  shows  that  C4.5  used  44%  of  the  25  features  formed  by 
DUALTREE  in  the  tree  represented  by  fbgbtime  and  12.5%  of  the  edges  in  the  tree 
contain  a  new  feature  as  a  predecessor  or  successor  node.  Additionally,  the  confusion 
matrices  of  Table  6.20  show  that  this  tree  correctly  classifies  50.4%  of  the  unseen  items. 
Hence,  our  method  of  feature  construction  gives  reasonable  results  for  learning  this 
relatively  difficult  concept. 


213 


Table  6.19  More  C4.5  results  for  the  BEBR  data 


fn 

■  I  ■*■  *•  ad 

# 

aits 

# 

obs 

Before  Pruning 
size     Errors     m 

After  Pruning        \ 
size     Errors      Est     m\ 

gbtime 

bgbtiine 

fbgbtime 

buyfais 

bbuyfais 

fbbuyfais 

5 

21 

46 

5 

18 

31 

5100 

707  [  2161(42.4%)! 
619  1 2177(42.7%)]  5 

611*  2183(42.8%)!  5 

1                     1 
1                     1 
1                     1 

389   2403(47.1%) 
409  12417(47.4%)!  5 
433  2406(47^%)}*  5 
1                      1 
....1... J.... 

_._._.| ._._._P ^.-.i^ 

51  !2456(48.2%|     50.6%        j 
10912386(46.8%)    49.7%  I  3i 
12ll2373(46J%j*  49.8%    3  j 

1                 1             1     : 
1                 1             1     i 
1                1             1     : 

47  I2591(50.8%]|     53.4%  j      ■ 

57  1  2577(50.5%]    52.8%  !  3  j 

59  !2564(50J%[)*52.7%|3J 

1                    1                1      j 

'P<fimc-nuiie>  -~>  C4.S  using  features  formed  with  Initial  Tree  'b'<func-name>  ■=>  C4.S  uses  "Binanzed'  fiks 

BOLDFACE*  ~>  Improvement'  in  perfonnance 


i' '■ ' '  ■'■ "  "'I 

II  BIB  II  BIB  II  BIB  II 

BIB  n  BIB  II  BIB  II  BI 

1  II  BIB  II  BIB  II  BIB  II 

BIB  II  BIB  II  BIB  II  BH  II  BIB  II  BIB  II  BIB  II 

Bia  11  BIB  II  BIB  II  Bift 

i 

# 

primes 
used 

# 

features 
formed 

# 

features 
used 

max 
size 
formed 

mcuc 
size 
used 

#  edges] 

with    I 

feature  \ 

:     fbgbtime 

21 

25 

11 

5 

3 

76      i 

s 

:    fbbuyfais 

18 

13 

4 

4 

3 

45      j 

■ 

Via  II  Bia  11  BIB  II  BIB  II  BIB  II 

BIB  II  BIB  II  BIB  II  Bi 

I  II  BIB  11  BIB  II  BIBI 

BIB  II  BIB  II  BIB  II  B 

■  II  ■!■  ii  ail  II  III 

1  BIB  II  BIB  II  BIB  II  an  II  BIB  II  BIB  II  BIB  iF 

Thus,  considering  demographic  and  'expectational'  descriptions  for  consumers,  it 
appears  that  consumer  views  on  purchasing  major  household  items  are  better  explained 
by  their  expectations  of  personal  finances  and  conditions  for  the  national  economy.  We 
end  this  section  by  showing  a  list  of  features  formed  by  DUALTREE  that  were 
subsequently  used  by  C4.5  to  build  the  trees  represented  by  fhgbtime  and  fbbuyfais. 
Table  6.21  shows  the  list  of  features.  Ob,serve  that  many  of  the  features  on  the  list 
represent  descriptions  of  consumers  who  were  unsure  about  their  expectations  regarding 
certain  events.   Hence,  considering  the  number  of  features  formed  including  an  attribute- 


214 
Table  6.20  Confusion  matrices  for  consumer  views  on  buying 

Confusion  Matrices  using  C4.5  Simplified  Trees 

fbgbtime 


.(?^ 

^} 

{c)     _^)     <==classified  as 

437 

6 

145       2   (a):  class  GOOD 

92 

8 

52       0   (b):  class  UNCERTAIN 

233 

4 

198       1    (c):  class  BAD 

67 

1 

30      0   (d):  class  DONT 
KNOW 

50.4%  of  1276  items  correctly  classified 

fbbuyfais 

k^). 

^). 

k9),      k^)     <==classified  as 

440 

1 

147       2   (a):  class  GOOD 

110 

0 

41        1    (b):  class  UNCERTAIN 

254 

0 

182       0   (c);  clas,s  BAD 

67 

0 

29       2   (d):  class  DONT 

KNOW 

48.9%  of  1276  items  correctly  classified 


value  of  'don't  know'  or  'uncertain',  we  can  say,  for  example,  that  meaningful 
descriptions  of  consumer  buying  plans  for  major  household  items  include  their  uncertainty 
regarding  certain  events. 

6.6   Complexity  Results 

We  conclude  this  chapter  by  examining  our  results  with  respect  to  the  time 
complexity  model  we  described  in  Chapter  4.  Recall  also  from  Chapter  5  that  the  time 
required  by  DUALTREE  to  form  'j'  features  is  proportional  to  (2V  -  1)  if  j  «  C(V  -  1), 


215 
Table  6.21  A  list  of  consumer  descriptions 

Features  used  in  'fbgbtime' 

1.  USFUn  =  Don't  Know   &&   TIME  =  90Q2 

2.  USFUFI  =  Uncertain  &&  TIME  =  90Q3 

3.  CURFIN  =  Worse  &&  FUTFIN  =  Same   &&  TIME  =  90Q4 

4.  CURFIN  =  Better  &&  USNEX5  =  Uncertain 

5.  USNEX5  =  Good  &&  TIME  =  91Q1 

6.  CURFIN  =  Don't  Know  &&  USFUH  =  Uncertain  8l&.  TIME  =  91Q2 

7.  FUTFIN  =  Don't  Know  &&  TIME  =  91Q2 

8.  CURFIN  =  Don't  Know   &&  FUTFIN  =  Worse 

9.  CURFIN  =  Better  &&  USNEX5  =  Uncertain  &&  TIME  =  90Q4 

10.  CURFIN  =  Same   &&  USFUH  =  Good 

11.  USFUFI  =  Uncertain  &&  USNEX5  =  Bad 

Features  used  in  'fbbuyfais' 

1.  FUTFIN  =  Worse  &&  AGE  >65  yrs  &&  TIME  =  90Q3 

2.  FUTFIN  =  Don't  Know  &&  AGE  >  65  yrs 

3.  INCOME  <  $25K   &&   SEX  =  Female 

4.  INCOME  =  $45K  -  $75K   &&   SEX  =  Female 


216 
where  V  is  the  number  of  vertices  or  nodes  and  C  is  the  number  of  classes  in  an  input 

decision  tree.    Otherwise,  the  time  necessary  for  feature  construction  is  proportional  to 

(CV  +  j).    Note  that  in  all  of  our  runs  the  number  of  features  formed  was  less  than  the 

number  of  vertices  appearing  in  the  decision  tree  used  by  DUALTREE  to  form  the 

features.  Hence,  we  can  say,  for  example,  that  j  «  C(V  -1)  because  j  is  less  than  V  and 

C  is  at  least  two.    This  means  that,  for  our  experiments,  the  time  used  for  forming  j 

features  is  proportional  to  (2V  -  1 ).  Given  this  information,  a  time  complexity  model  for 

our  learning  system,  given  some  proportionality  constant  a,  is: 

(n+lf   >   (n+l+jf '  +  a(2V-  1) 

where  n  is  the  number  of  primitive  attributes  and  r  is  the  rank  of  the  tree  over  the  n 

primitives.     Given  a  decision  tree  constructed  using  j  new  features,  i  represents  the 

amount  of  rank  reduction  attributed  to  the  tree.    For  example,  if  the  original  tree  using 

the  primitive  attributes  has  a  rank  of  three,  and  a  tree  constructed  using  features  and  the 

primitives  has  a  rank  of  I,  then  we  reduce  the  rank  by  two,  hence,  a  rank  reduction  of 

two  is  associated  with  the  trees  of  this  example.    The  constant  of  proportionality,  a,  is 

determined  by  certain  criteria  related  to  the  implementation  of  our  algorithm  for  forming 

a  finite  number  of  features. 

Recall  that  a  key  assumption  of  our  model  is  that  we  are  able  to  reduce  the  rank 

of  a  tree.    In  many  of  the  cases  we  were  unable  to  reduce  the  rank  of  the  original  tree. 

However,  in  some  of  those  cases,  we  were  able  to  improve  performance  according  to 

other  criteria.   Considering  the  complexity  model  for  our  experiments.  Table  6.22  shows 

the  values  for  each  component  of  the  model  associated  with  every  pair  of  decision  trees 


217 
described  in  this  chapter  (i.e.,  one  before  and  one  after  feature  construction).    Note  that 

we  list  the  fn  of  the  tree  before  feature  construction  in  the  table. 

Observe  from  the  table  that  we  gained  some  improvement  on  the  order  of  time, 

according  to  our  model  and  the  tree-construction-component,  in  eight  of  the  29  cases.  In 

one  case  we  see  that  we  produced  a  tree  having  a  higher  rank  than  the  original  tree  (i.e., 

i  =  -1).   For  this  case,  we  can  say,  for  example,  that  there  is  a  substantial  cost  associated 

with  building  the  tree  represented  by  fdnf3,  using  the  41  features  formed  from  the  tree 

represented  by  dnf3.    Also,  observe  that  in  three  of  the  cases,  bbswdk,  bb(sdk)w,  and, 

bbs(wdk),  even  though  we  were  able  to  reduce  the  rank  of  the  original  tree,  we  did  not 

realize  any  'savings'  associated  with  these  cases.     These  cases  appear  to  represent 

conditions  unfavorable  for  using  a  second-ordered  approximation  for  our  function  as 

described  in  Chapter  4.    Recall  that  we  used  a  second-degree  Taylor  polynomial  as  an 

approximation  to  our  function.   This  assumes  that  the  numerical  value  of  the  remainder 

after  using  three  terms  in  our  approximation  is  relatively  small.    Hence,  for  these  cases, 

the  error  involved  in  using  our  polynomial  approximation  appears  to  be  considerable. 

Finally,  note  that  the  time  required  to  form  j  features,  which  is  proportional  to  (2V  -  1), 

is  on  the  order  much  lower  than  the  time  it  takes  to  build  a  tree.  This  is  a  characteristic 

of  a  practical  feature-construction  algorithm. 


218 


Table  6.22  Complexity  results 


fn 

"""dEfr""" 

n 

""'Vd' 

r 

■!■!■ 

5 

V 

■laiaiaiBi 

467 

• 

[J     J 
76 

• 

1 

IIBIBII 

0 

i(n+l)   J 

IIBIBIBIBIBIBII 

1.2E19 

2(r-l 

(n+l+j)  J 

■■■■■■■IHIBiaiB 

9.1E21 

D 

2V-1 

■IBIBIBIBl 

933 

bdnfl 

160 

5 

465 

107 

0 

1.2E22 

1.9E24 

929 

dnf2 

40 

4 

301 

43 

0 

8.0E12 

2.5E15 

601 

bcinf2 

80 

4 

301 

59 

0 

1.9E15 

1.5E17 

601 

dnf3 

32 

3 

137 

41 

-1 

1.3E9 

9.0E14 

273 

bdnf3 

64 

3 

137 

33 

0 

7.5E10 

8.9E11 

273 

dnf4 

64 

6 

593 

81 

0 

5.7E21 

9.4E25 

1185 

bdnf4 

128 

6 

593 

54 

0 

2.1F,?,5 

1.4E27 

1185 

mx6 

16 

3 

43 

3 

0 

2.4E7 

6.4E7 

85 

bmx6 

32 

3 

43 

9 

0 

1.2E9 

5.5E9 

85 

mxll 

32 

5 

185 

17 

1 

1.5E15 

3.9E13 

369 

bmxll 

62 

5 

185 

27 

2 

9.9E17 

5.3E11 

369 

par4 

16 

7 

591 

18 

2 

1.7E17 

2.6E15 

1181 

bpar4 

32 

7 

591 

40 

2 

1.8E21 

4.3E18 

1181 

par5 

32 

8 

1705 

46 

2 

2.0E24 

5.9E22 

3409 

bpar5 

64 

8 

1711 

57 

3 

1.0E29 

7.3E20 

3421 

bmonksl 

17 

1 

7 

2 

0 

324 

400 

13 

bmonks2 

17 

4 

113 

5 

1 

I.IEIO 

1.5E8 

225 

bmonks3 

17 

1 

9 

2 

0 

324 

400 

17 

bmushroom 

125 

2 

23 

9 

0 

2.5E8 

3.3E8 

45 

btictactoe 

27 

4 

101 

29 

1 

3.8E11 

3.4E10 

201 

bcrx 

47 

3 

117 

31 

0 

1.2E10 

2.4E11 

233 

bglass 

67 

3 

67 

20 

0 

9.9E10 

4.6E11 

133 

bbswdk 

48 

6 

2187 

188 

1 

1.9E20 

5.6E23 

4373 

b(bdk)sw 

48 

6 

2185 

73 

0 

1.9E20 

1.1E25 

4369 

bb(scik)w 

48 

6 

2157 

168 

1 

1.9E20 

2.3E23 

4313 

bbs(wdk) 

48 

6 

2261 

83 

1 

1.9E20 

1.6E21 

4521 

bgbtime 

21 

5 

619 

25 

0 

2.7E13 

5.3E16 

1237 

bbuyfais 

18 

5 

409 

18 

0 

6.1E12 

4.8E12 

817 

xEy  ==>  xy 

CHAPTER  7 
CONCLUSIONS 


In  this  final  chapter,  we  begin  with  a  brief  summary  of  this  dissertation.  Next  we 
reiterate  our  objectives  and  state  how  they  were  achieved.  Finally,  we  identify  open 
research  problems. 

7.1    Summary 

Econometric  models  based  on  traditional  statistical  techniques,  are  poorly  suited 
for  exploring  certain  questions  regarding  the  1990-1991  recession.  Furthermore,  common 
practice  for  computing  a  balance  score  when  constructing  a  consumer  confidence  index 
using  business  surveys,  groups  'don't  know'  responses  of  survey  questions  with  the 
'same'  responses.  For  the  combined  group,  this  practice  may  mask  key  bits  of 
information  associated  with  the  individual  responses.  Hence,  using  business  survey  data, 
existing  models  that  are  based  on  statistical  techniques,  are  not  always  helpful  for 
exploring  questions  regarding  certain  consumer  attitudes  and  expectations,  during  the 
recent  recessionary  period. 

This  dissertation  proposed  the  development  of  an  empirical  model  of  learning 
using  Artificial  IntelUgence  (Al)  methods  and  techniques.  Al  based  knowledge- 
acquisition  procedures  usually  examine  examples  of  solved  cases  and  give  general 
decision  rules  in  terms  of  a  pre-defined  structure.   Models  based  on  Al  methods  offer  an 

219 


220 
advantage  of  being  able  to  examine  the  relationships  between  a  fairly  large  set  of 

attributes.  Other  advantages  of  models  based  on  AI  methods  include  (1)  they  are  capable 

of  working  with  quantitative  and  qualitative  scaled  variables,  and,  (2)  they  may  overcome 

certain  limitations  associated  with  models  based  on  statistical  methods. 

Because  decision  trees  are  currently  the  most  highly  developed  AI  technique  for 
partitioning  a  sample  into  a  set  of  covering  decision  rules,  we  used  decision  trees  as  the 
pre-defined  structure  in  our  empirical  model.  A  decision  tree  is  a  structure  consisting  of 
nodes  and  branches  where  each  node  represents  a  test  or  decision.  Decision  trees 
constructed  using  solely  a  set  of  primitive  attributes  may  have  several  shortcomings. 
These  include  (1)  human  beings  may  not  be  able  to  easily  interpret  a  tree  due  to  its 
complexity  and  inscrutability,  (2)  a  tree  may  fragment  individual  subconcepts,  and,  (3) 
there  are  relatively  few  approaches  for  using  a  decision  tree  to  correlate  certain  'bits'  of 
information  among  the  set  of  primitives  (i.e.,  develop  more  'task-specific'  attributes). 

This  dissertation  proposed  feature  construction  as  a  potential  solution  to  the 
problem.  Feature  construction  is  a  technique  for  creating  new  features  which  are 
combinations  of  the  primitive  attributes.  This  'representation  change'  has  the  effect  of 
altering  the  form  of  the  learning  problem's  instance  space.  Learning-algorithms  that 
incorporate  some  form  of  feature  construction  when  constructing  a  decision  tree,  may 
produce  'better'  trees,  when  compared  to  trees  created  using  only  the  primitive  attributes. 
Feature  construction  can  improve  hypothesis  accuracy,  conciseness,  and  leaming 
efficiency.    Another  advantage  of  feature  construction  is  that  decision  trees  constructed 


221 
using  the  primitive  attributes  and  a  set  of  features  formed  from  the  primitives,  may  not 

possess  duplications  of  decision-sequences  or  patterns. 

Current  feature-construction  algorithms  found  in  the  literature  typically  form 
features  using  a  decision  tree,  and,  are  also  based  on  heuristic  approaches  like,  for 
example,  choosing  attributes  for  forming  new  features  based  on  the  attributes'  relative 
positions  in  a  tree.  Also,  these  algorithms  typically  build  several  decision  trees  when 
producing  a  finite  number  of  new  features.  Thus,  considering  the  time  complexity  of 
feature  construction,  current  feature-construction  algorithms  require  a  large  amount  of 
time  to  process  the  examples. 

To  ameliorate  these  problems,  first,  we  established  a  time-complexity  model  of 
feature  construction  consisting  of  a  tree-construction-factor,  and,  a  feature-construction- 
factor.  Next,  Chapter  4  describes  our  approach  to  feature  construction,  DUALTREE, 
which  is  based  on  formal  theories  found  in  the  literature  (i.e.,  category  theory,  the  theory 
of  finite  state  machines,  and.  Boolean  algebra).  DUALTREE  forms  a  finite  number  of 
features  using  the  'dual'  of  a  decision  tree.  Essentially,  our  procedure  consist  of  four 
steps  and  does  not  'iterate'  by  building  a  new  decision  tree  at  interim  steps  in  the  process. 
Thus,  DUALTREE  symbolizes  a  new  approach  to  feature  construction. 

Chapter  5  describes  our  algorithm.  We  designed  DUALTREE  based  on  the 
procedure  described  in  Chapter  4  and  implemented  it  using  the  C-i~i-  programming 
language.  DUALTREE  is  a  practical  algorithm  for  forming  a  finite  number  of  features 
since  it  does  not  require  a  large  amount  of  time  or  storage  space  to  form  the  features. 


222 
DUALTREE's  features  were  tested  in  depth  on  several  Boolean  functions  as  well 

as  classification  problems  examined  by  other  researchers.    Chapter  6  shows  that  for  a 

given  'goodness  of  split'  measure-namely  the  gain  ratio  criterion-DUALTREE  forms 

useful  features;  useful  in  the  sense  that  these  features  are  likely  to  be  chosen  by  the 

heuristics  used  to  build  a  decision  tree.    The  chapter  also  shows  that  DUALTREE's 

features  aid  in  building  better  trees  with  regard  to  the  size,  rank,  misclassification  rate, 

and  predictive  accuracy  for  the  tree.   A  series  of  experiments  using  the  BEBR  business 

survey  data,  showed  that  our  empirical  tool  can  be  used  to  create  and  examine  certain 

types  of  descriptions  for  consumer-households  during  the  recent  recession. 

7.2   Attainment  of  Goals 

In  the  following  sections,  fu-st,  we  reiterate  the  theses  and  objectives  set  forth  in 
Chapter  1 .    This  is  followed  by  a  summary  of  how  we  accomplished  each  objective. 

7.2.1    Problem  Definition 

Thesis:  We  can  build  a  decision  tree  within  the  standard  time  complexity,  by  adding  at 
mostj  new  and  useful  features  to  an  existing  set  of  features.  Useful  features  are  features 
that  are  likely  to  be  selected  by  the  heuristics  used  to  build  the  decision  tree. 

Objective:  Define  'feature  construction'  and  develop  an  analytical  model 
showing  how  the  temporal  behavior  of  an  algorithm  changes  as  we  add  additional 
features. 


223 
Accomplishment:    In  Chapter  4,  we  defined  feature  construction  as  a  technique 

for  creating  new  features  which  are  combinations  of  a  set  of  primitive  attributes. 

This  is  a  type  of  representation  change,  where  each  term  or  feature  is  a  conjunct 

of  the  primitive   attributes.      Chapter  4  also  presents  our  analytical   model 

representing  the  time  complexity  of  an  algorithm  that  includes  some  form  of 

feature   construction.      Our   model   separates   the   time   complexity   into   two 

components:     a    tree-construction    component,    and,    a    feature-construction 

component    Using  our  components,  we  show  how  the  time  complexity  behaves 

in  terms  of  (1)  the  rank  of  a  decision  tree,  and,  (2)  the  number  of  features  formed 

and  added  to  an  existing  set  of  primitive  attributes. 

Objective:        Identify    the    difficulties    and    conditions    for    improving    the 

computational  efficiency  of  feature  construction. 

Accomplishment:     Our  analysis  is  essentially  based  on  a  comparison  of  two 

decision  trees— one  before  and  one  after  a  finite  number  of  features  are  added  to 

the  set  of  attributes  used  to  build  the  trees.    We  focused  on  the  two  components 

of  our  model  and  studied  the  characteristics  of  each  of  them.    Considering  the 

tree-construction  component,  we  established  a  lemma  that  gives  a  relationship 

between  the  initial  and  final  ranks  of  two  trees,  and  the  number  of  features  added 

to  the  featvu-e-set.  From  the  lemma,  we  find  that  the  difference  between  the  ranks 

of  the  initial  and  final  trees,  limits  the  number  of  features  useful  for  reducing  the 

rank  of  a  decision  tree.    For  the  feature-construction  component,  we  presented  a 

practical  procedure  for  forming  a  limited  number  of  features  using  a  binary 


224 
decision  tree.     Our  procedure  is  practical  because  the  time  required  to  form 

features:  (1)  does  not  depend  on  the  sample  size,  and,  (2)  is  linear  in  terms  of 

parameters  for  a  given  tree. 

Objective:    Establish  general  conditions  that  must  be  satisfied  by  any  approach 

for  resolving  the  problem. 

Accomplishment:      Our  model  suggests  two  general  characteristics  for  any 

computationally  efficient  or  practical  feature-construction-algorithm.    One  is  that 

the  time  required  by  the  algorithm  to  form  a  limited  number  of  new  features  is 

much  less  than  time  it  takes  to  build  a  new  decision  tree  using  the  features.  The 

other  characteristic  is  that  the  limited  number  of  new  features  formed  by  the 

algorithm  aid  in  building  a  decision  tree  whose  rank  is  smaller  than  the  rank  of 

the  initial  tree  used  to  form  the  new  features.   An  additional  characteristic  is  that, 

using  a  practical  algorithm,  the  difference  in  the  ranks  of  the  initial  and  final  trees, 

and,  the  number  of  new  features  formed,  satisfy  the  conditions  set  forth  in  Lemma 

4.1. 


7.2.2    Problem  Resolution 

Thesis:  The  DUALTREE  procedure  for  forming  useful  features,  produces  feature-sets  of 
lower  cardinality.  "Useful"  in  the  sense  that  the  features  it  creates  and  adds  to  the  set 
of  primitive  attributes  from  which  they  were  formed,  are  likely  to  be  used  by  tree- 
construction-heuristics  (i.e.,  given  two  features  having  high  information  gain  for  a  subset 


225 
of  the  instances,  the  gain-ratio  criteria  selects  the  one  giving  the  higher  proportion  of 

split-information),  and,  is  a  'minimal'  number  of  features. 

Objective:  Identify  the  range  of  possible  methods  for  forming  features. 
Accomplishment:  Chapter  4  describes  two  approaches  for  forming  new  features. 
Focusing  on  the  problem  of  searching  the  space  of  all  possible  feature-sets,  we 
expressed  the  time  complexity  result  in  terms  of  the  number  of  new  features 
formed.  Considering  our  representation  for  a  feature,  we  described  approaches  for 
effectively  searching  the  space  of  features  to  find  'useful'  features  for  a  problem. 
Objective:  Use  the  conditions  given  by  the  problem  definition  to  explore  new 
approaches  to  feature  construction  in  a  computationally  efficient  way. 
Accomplishment:  We  identified  several  general  approaches  for  feature 
construction,  and  discussed  the  computational  'cost'  associated  with  each  of  them. 
We  also  discussed  the  cardinality  of  feature-sets  formed  by  each  approach.  To 
ensure  completeness,  the  range  of  approaches  for  each  aspect  was  defined  using 
a  simple  model  in  which  feature  construction  processes  are  related  through  the 
information  structure  and  the  infotynation  quantity  associated  with  a  given  data- 
set. 

Objective:  Identify  a  procedure  for  constructing  'useful'  features  such  that  the 
time  required  to  construct  a  new  decision  tree  using  these  features  and  features 
from  a  given  decision  tree,  is  on  the  order  lower  that  the  time  used  to  produce  the 
given  tree. 


226 
Accomplishment:    Chapter  4  shows  how  DUALTREE  forms  a  hmited  number 

of  new  features.    Also,  in  general,  the  time  complexity  of  DUALTREE  depends 

on  parameters  of  the  input  decision  tree.    We  also  described  how  DUALTREE 

satisfies  certain  conditions  established  by  our  analytical  model. 

7.2.3    Implementation  and  Experiments 

Thesis:    An  empirical  learning  model  using  Decision  Trees  and  DUALTREE  feature 

construction,  provides  'useful'  descriptions  from  the  BEBR  business  surveys,  allowing  us 

to  test  the  following  hypotheses: 

(J )  Given  demographic  descriptions  for  the  four  categories  of  responses 
representing  the  respondents'  future  financial  condition,  descriptions  for  the  'don't 
know'  category  are  unique,  in  that  they  are  unlike  any  of  the  remaining  three,  to 
a  certain  extent,  and  should  not  be  combined  with  one  of  the  remaining  three  to 
examine  certain  changes  taking  place  during  the  1990  -  1991  recession. 

(2)  Changes  taking  place  in  the  'don't  know'  category  of  consumer 
expectations  for  personal  finances,  buying  plans,  and  the  national  economy,  best 
explain  the  change  in  consumers'  unwillingness  to  purchase  cars.  Since  this 
category  is  not  considered  in  most  methods  for  computing  a  consumer-confidence- 
metric,  this  may  explain  why  consumer  spending  was  over-predicted  for  motor 
vehicle  purchases. 

Objective:   Use  the  procedure  identified  previously  to  encode  an  algorithm  using 

C++. 

Accomplishment:    Chapter  5  describes  a  program  based  on  our  DUALTREE 

feature-construction-framework,   encoded    using   ANSI    Standard    C++.      The 

DUALTREE  program  essentially  processes  graphs,  or  linked  lists,  data  structures. 

We  discussed  the  key  parts  of  our  program  including  an  analysis  of  the  algorithm 

where  appropriate.  If  the  number  of  features  formed  is  much  less  than  the  number 


227 
of  classes  times  the  number  of  nodes  in  the  original  tree  we  show  the  time 

complexity  for  feature  construction  as  linear  in  the  number  of  vertices  in  the 

original  tree.   Otherwise  it  is  linear  in  the  sum  of  the  number  of  classes  times  the 

number  of  vertices,  and,  the  number  of  new  features  formed. 

Objective:     Implement  and  empirically  test  two  hypotheses  using  the  BEBR 

sample  data. 

Accomplishment:  DUALTREE  was  extensively  tested  on  Boolean  functions  and 

other  classification  problems.  We  also  described  experiments  using  nominal  and 

continuous  data.    These  experiments  were  designed  to  determine  the  advantages 

and  limitations  of  our  approach;  the  independent  variable  tested  was  essentially 

a  set  of  features.   Using  the  BEBR  sample  data,  we  designed  experiments  to  test 

the  'descriptions'  for  various  categories  suggested  by  DUALTREE's  features. 

Objective:  Analyze  the  results  quantitatively  and  determine  the  relative  worth  of 

the  proposed  methods. 

Accomplishment:    Chapter  6  displayed  and  discussed  our  experimental  results. 

Using  binary,  nominal,  and  continuous  data-sets,  DUALTREE's  features  aid  in  (1) 

simplifying  decision  trees,  and.  (2)  improving  a  tree's  performance  for  classifying 

'unseen'  examples.     Additionally,  in  most  cases,  a  meaningful  number  of  the 

features  formed  by  DUALTREE  appear  in  trees  constructed  using  C4.5. 

Considermg  the  BEBR  data,  we  offered  faint  evidence  showing  that  demographic 

descriptions  of  unsure  respondents  regarding  their  future  financial  expectations  more 

closely  resemble  descriptions  of  respondents  who  replied  'the  same',  than  any  of  the  other 


228 
alternative  replies.     Regarding  our  other  hypothesis,  we  determined  that  we  required 

additional  information  to  fully  examine  it.  We  instead  provided  an  illustration  of  how  our 

tool  may  be  used  by  examining  certain  types  of  consumer-attitudes  regarding  whether  or 

not  it  is  a  good  time  to  purchase  major  household  items.    Our  illustration  revealed  that 

consumer  uncertainty  appears  to  be  a  meaningful  factor  in  descriptions  of  consumer 

attitudes  regarding  certain  events. 

7.3    Future  Research 

The  final  result  of  this  dissertation  is  the  development  of  an  empirical  tool  for 
examining  the  information  found  in  business  survey  data.  In  these  concluding  sections 
I  describe  several  extensions  for  our  approach  and  offer  other  research  topics  worth 
pursuing. 

7.3.1    Improved  Model  Development 

In  many  cases,  when  using  DUALTREE's  features,  C4.5  did  not  produce  a 
decision  tree  having  a  rank  of  one.  A  solution  to  this  may  be  to  used  another  selection 
criteria  in  addition  to  the  gain-ratio  criterion,  or,  using  a  completely  different  heuristic  for 
partitioning  a  sample.  These  new  heuristics  should  focus  on,  for  example,  selecting 
features  which  aid  in  building  decision  trees  having  a  rank  of  one. 

Considering  the  inferences  using  the  BEBR  data,  various  forms  of  economic  data 
for  the  1990-1991  recession,  are  currently  available  through  several  sources.  A  interesting 


229 
study  would  be  to  see  if,  for  example,  the  1991-1992  data  in  terms  of  income,  savings, 

interest  rates,  and  the  like,  reflect  consumers'  expectations  in  the  previous  year. 

7.3.2   Problems  in  the  Study  of  Feature  Construction 

A  key  question  remaining  in  the  problem  of  feature  construction  is  how  to  find 
a  set  of  features  that  aid  in  building  a  decision  tree  having  a  rank  of  one.  We  require  this 
in  order  to  use  certain  research  results  found  in  the  literature.  Ideally,  we  do  not  want 
to  use  the  maximum  number  of  features  available.  Our  problem  may  depend  on  the 
heuristics  used  to  build  a  decision  tree.  There  are  several  attribute  selection  measures 
found  in  the  literature.  Recall  that  FIND(S,r)  (Ehrenfeucht  and  Haussler  1989),  focused 
on  building  a  binary  decision  tree  having  a  rank  of  one.  This  also  depends  on  the 
heuristics  used  to  build  a  decision  tree.  Lopez  De  Mantaras  (1991)  offers  a  new  attribute 
selection  measure  for  ID3-like  inductive  algorithms.  Based  on  a  'distance'  between 
partitions,  his  measure  offers  certain  advantages  over  ID3-like  measures  such  as  the  gain- 
ratio  measure.  Considering  the  BEBR  data,  it  may  be  worthwhile  to  examine  a  'distance' 
measure,  since  in  essence,  we  are  studying  a  'subjective  distance'  between  the  consumer 
responses  of  'better',  'same',  'worse',  and,  'don't  know'. 


APPENDIX  A 
BEBR  SURVEY  QUESTIONS  AND  VARIABLE  ASSIGNMENTS 


CURFIN  ==>  Current- Personal-Financial  Component 

We  are  interested  in  how  people  are  getting  along  financially  these  days.  Would 
you  say  that  you  (and  your  family  living  there)  are  better  off  or  worse  financially 
than  you  were  a  year  ago  ? 


<1> 

Better  off 

<2> 

Same 

<3> 

Worse  off 

<-8> 

Don't  know 

<-9> 

Not  available 

FUTFIN  ==>  Future-Personal-Financial  Component 

Now,  looking  ahead  -  do  you  think  that  a  year  from  now  you  (and  your  family 
living  there)  will  be  better  off  financially,  or  worse  off,  or  just  about  the  same  as 
now  ? 


<1> 

Better  off 

<2> 

Same 

<3> 

Worse  off 

<-8> 

Don't  know 

<-9> 

Not  available 

USFUFI  ==>  U.S.  1-Year  Component 

Now  turning  to  business  conditions  in  the  country  as  a  whole  —  do  you  think  that 
during  the  next  12  months  we'll  have  good  times  financially,  or  bad  times,  or 
what  ? 

<1>      Good  times 

<2>      Good  with  qualifications 

<3>      Uncertain;  Good  and  Bad 

230 


231 

<4>  Bad  times 

<5>  Bad  with  qualifications 

<-8>  Don't  know 

<-9>  Not  available 

USNEX5  ==>  U.S.  5-Years  Component 

Looking  ahead,  which  would  you  say  is  more  likely  --  that  in  the  country  as  a 
whole  we'll  have  continuous  good  times  during  the  next  five  years  or  so,  or  that 
we  will  have  periods  of  widespread  unemployment  or  depression,  or  what  ? 

<1>  Good  times 

<2>  Good  with  qualifications 

<3>  Uncertain;  Good  and  Bad 

<4>  Bad  times 

<5>  Bad  with  qualifications 

<-8>  Don't  know 

<-9>  Not  available 

GBTIME  ==>  Six-Month-Household  Component 

About  the  big  things  people  buy  for  their  homes  —  such  as  furniture,  a 
refrigerator,  stove,  television,  and  things  like  that.  Generally  speaking,  do  you 
think  now  is  a  good  or  a  bad  time  for  people  to  buy  major  household  items  ? 

<1>  Good  time 

<2>  Uncertain 

<3>  Bad  time 

<-8>  Don't  know 

<-9>  Not  available 


(1)  AGEl  ==>      Broader  Age  Group  of  Respondents 

'18-45'  '45-65'  '>65' 

(2)  EDUCAT         ==>      Level  of  Education 

'<9'  'HighSchool'  'HighSchoolGrad'  'College' 
'CollegeGraduate'  GraduateSchooF 

(3)  EMPLOY        ==>      Are  You  Employed  Now? 

'Yes'  'No' 

(4)  HOSIZE  ==>      Household  Size 

T    '2'    '3'    '4'  '5'    '>=6' 


232 


(5)  INCOME 

(6)  MARRY 

(7)  MSA 

(8)  OCCUPA 

(9)  PID 

(10)  RACE 

(11)  SEX 

(12)  TIME 


==>      Categories  of  Family  Annual  Income 

'<  $25000'  '25000  TO  44999'  '45000  TO  74999' 

'>  $75000' 

==>      Marriage  Status 

'Now  Married'  'Widowed'  'Never  Married' 

'Divorced/Separated' 

==>      'Metropolitan  Statistical  Area' 
'Yes'  'No' 

==>      Major  Categories  of  Jobs 
'Farmers'  'Managers'  'Military  or  Misc' 
'Professional'  'Sales' 

==>      Party  Affiliation 

'Republican'  'Democrat'  'Independent' 

'Other  Party'  'No  Preference' 

==>      Racial  Background 
'White'  'Black'  'Oriental'  'Other' 

==>      Respondent's  Sex 
'Male'  'Female' 

==>      Calendar  time;  year  and  quarter 
'90Q2'    '90Q3'    '90Q4'    '91Qr    '91Q2' 


APPENDIX  B 
DUALTREE  SOURCE-CODE  LISTINGS 


Newface::Initialize(char  *fn) 

C++  source-code  for  reading  the  input  'edges'  file  produced  by  C4.5 

II  Member  Function  // 

void      Newface::Initialize(char  *fn) 

// 

//  Parameters 

// 

//  fn  is  the  filename  used  when  executing  C4.5 

// 

//  The  file  name  is  the  file  containing  the 

//  edges. 

//  The  program  looks  for  a  file  having  an 

//  extension  of  'edg'  for  DOS  and  'edges'  for 

//  Unix. 

{ 

if  (ReadName(Nf,pred)  &&  ReadName(Nf,outcome) 
&&  ReadName(Nf,succ))     { 
String&  Pred  =  *(new  String(pred)); 
String&  Eval  =  *(new  String(outcome)); 
String&  Succ  =  *(new  String(succ)); 

X  =  Pred.hashValueO; 

V  =  Eval.hashValueO; 

y  =  Succ.hashValueO; 

bkx.operator=(x); 

XX  =  HVtreeSearch(bkx); 

//  see  if  Pred  in  tree 

if  (XX  !=  MyNil)    { 


233 


234 
//  Pred  already  in  tree 


xt  =  adj[xx]; 
BREAKFLAG  =  0; 
while  (xt  !=  adjz)     { 

if  (xt->edgeval==v)  BREAKFLAG=  1; 

xt  =  xt->next; 

} 

if  (BREAKFLAG)   continue; 

//  edge  already  added 

} 
else    { 

//  edge  not  in  tree 

anodex  =  new  node; 

anodex->nameval  =  x; 
anodex->Name  =  Pred. operator  char  *(); 

anodex->key.operator  =(x); 

lDarray[++arrayID]  =  anodex; 

anodex->Myid  =  arraylD; 

XX  =  arraylD; 

HVtreelnsert  (bkx,  anodex); 
} 

bky.operator=(y); 

aleaf  =  0; 

yy  =  CheckClass(y); 

if  (yy  >  Max  Vertices)    aleaf  =  1; 

//  a  leaf 
if  (((Hvtreesearch(bky))  ==  MyNil) 
&&  !( aleaf) ){ 

anodey  =  new  node; 

anodey->nameval  -  y; 
anodey->Name  =  Succ. operator  char  *(); 

anodey->Myid      =  MyNil; 

anodey->key. operator  =(y); 

Idarray[++arraylD]  =  anodey; 

anodey->Myid  =  arraylD; 

yy  =  arraylD; 

Hvtreeinsert  (bky,  anodey); 

> 
else  { 

if  (!{aleaf))  yy  =  Hvtreesearchibky); 

} 

t  =  new  adjnode; 

t->Adjid       -  yy; 


235 


t->edgeval  =  v; 
t->next      =  adj[xx]; 
adj[xx]       =  t; 
}  //  if  readnames 

};  //  end  member  function  Initialize 


236 

Newface::ConstructDual(  ) 

C++  source-code  for  updating  an  array-structure  representing  the  'dual' 
of  a  binary  decision  tree 

II  Member  Function  // 

void  Newface::ConstructDual(struct  adjnode 

*  arr  ay  [  M  ax  Adj  S  i  ze] ) 
//  Creates  the  'dual'  of  the  adjacent  matrix 

//  constructed  by  the  procedure.    This  public 

//  function  fills  the  'dual'  information  in  the 

//  array  passed  to  it. 

//  Parameters 

//  array  is  the  array-to-fiU 

{ 

for  (j  =1;  j  <k;  j++)     { 
if  (val[j]  ==  unseen)     { 

DualVisit(  array,]); 
} 
} 
};  //  end  member  function  ConstructDual 

//  Member  Function  // 

void  Newface::DualVisit(struct  adjnode 

*array[MaxAdjSize],  int  k) 
//  Traverses  the  graph  and  visits  all  the 

//  nodes. 

//  Parameters 

//  array  is  the  'dual'  adjacency  matrix 

//  k  is  the  index  for  the  matnx 

{ 

val[k]  =  ++id; 

for  (t  =  adj[k];  t  !=  adjz;  t  =  t->next)    { 

dualnode  =  new   adjnode; 

dualnode->Adjid     =  k; 

dualnode->edgeval  =  t->edgeval; 

dualnode->next      =  array[(t->Adjid)]; 

array[(t->Adjid)]  =  dualnode; 

if  (val[t->Adjid]  ==  unseen) 


237 
DualVisit(array,t->Adjid); 


} 
};  //  end  member  function  DualVisit 


238 

Dualtree: :BuildClassSucc(  ) 

Source-code  for  finding  the  successors  for  a  class  node 

II  Member  Function  // 

void  Dualtree:  :BuildClassSucc(int  id) 

//  Forms  the  '0'  and  T  value  successors 

//  for  a  given  CLASS  represented  by  its  numerical 

//  id  number 

//  Parameters 

//  id  is  the  numerical  number  for  a  given 

//  CLASS 

{ 

t  =  DUALADJ[id];  //  predecessor  feature 

while  (t  !=  DUALz)     { 

anintnode  =  new  intnode; 
anintnode->Adjid  =  t->Adjid; 
if  ((t->edgeval)  ==  one)     { 

//  successor  for  a  '  1 '  value 
anintnode->next  =  OneValList; 
OneValList  =  anintnode; 

} 

else     { 

anintnode->next  =  ZeroValList; 

ZeroValList  =  anintnode; 
} 

t  =  t->next;         //  advance  the  list  pointer 
} 


239 

C++  Code  for  Adding  New  Features 


if  (ValList  !=  intz)    {  //  not  empty 
bitval.operator=(ValHash); 
zz  =  Hvtreesearch(bitval);     //  check  tree 
if  (  zz  =-  MyNil)     {  //  feature  not  in  tree 

d  =  new  dualnode; 

d->featureval   =  GetListHash(ValList); 

d->Feature       =  ValList; 

zz  =  FeaturelD; 

Hvtreeinsert(bitval,d); 


else     { 

zz  =  MyNil;      //  id  for  the  'null-node' 

} 

dt  =  new  dualadjnode; 

dt->DualAdjid  =  zz;  //  id 

dt->edgeval  =  value;  //  a  '0'  or  T 

dt->next      =  dualadj[id]; 

dualadj[id]  =  dt; 


240 

Dualtree::BuildDualofDual(  ) 

C++  code  for  writing  the  'dual  of  the  dual'  information 

II  Member  Function  // 

void  Dualtree::BuildDualofDual(int  k) 

//  This  function  builds  the  'dual  of  the  dual', 

//  represented  as  an  adjacency  matrix.    It  recursively 

//  traverses  the  graph 

//  Parameters 

//  k  represents  a  feature 

{ 

val[k]  =  ++dualid; 

for  (t  =  dualadj[k];  t  !=  dualadjz;  t  =  t->next)    { 
if  (t->DualAdjid  !=  MyNil)     ( 

dualdualnode  =  new   dualadjnode; 
dualdualnode->DualAdjid   =  k; 
dualdualnode->edgeval      =  t->edgeval; 
dualdualnode->next  =  CLASSadj[(theclass  - 

adjcstart  +  l)][(t->DualAdjid)]; 
CLASSadj[(theclass  -  adjcstart  +  1)] 

[(t->DualAdjid)]  =  dualdualnode; 
if  (val[t->DualAdjid]  ==  unseen) 

BuildDualofDual(t->DualAdjid); 
} 
} 
I;  //  end  member  function  BuildDualofDual 


241 

Dualtree::FindTerminals(  ) 

C++  code  for  creating  a  list  terminal-features 


II  Member  Function  // 

void  Dualtree::FindTerniinals(int  id) 

//  This  function  finds  all  of  the  terminal 

//  features,  or  leaves,  given  a  Class  name  represented 

//  by  its  corresponding  id  number.    The  function 

//  recursively  calls  itself  to  traverse  the  adjacency 

//  matrix  starting  with  the  given  node. 

//  Parameters 

// 

//  id  is  the  integer  associated  with  a  class. 


{ 


for  (pt  =  dualadj[id];  pt  !=  dualadjz; 


pt  =pt->next) 


if  (pt->DualAdjid  !=  MyNil)     { 
if  (FeatureIDarray[(pt->DualAdjid)]->isTerminal)  { 

if  (val[(pt->DualAdjid)]  ==  unseen)  { 

val[(pt->DualAdjid)]  =  ++dualid; 
anintnode  =  new  intnode; 
anintnode->Adjid=pt->DualAdjid; 
anintnode->next   = 
CLASStermList[(theclass-adjcstart+l)]; 
CLASStermList[(theclass-adjcstart+l)]=anintnode; 

} 
}     //  next  we  visit  this  node's  list 
if  (term[(pt->DualAdjid)]  -=  unseen) 

FindTerminals(pt->DualAdjid): 

} 

} 
(;  //  end  member  function  FindTerminals 


REFERENCE  LIST 


Adamek,  J.  and  V.  Trnkova,  Automata  and  Algebras  in  Categories.  Boston,  MA: 
Kluwer  Academic  Publishers,  1990. 

Alagar,  Vangalur.  Fundamentals  of  Computing  --  Theory  and  Practice.  Englewood  Cliffs, 
NJ:  Prentice-Hall,  Inc.,  1989. 

Angluin,  D.  and  C.  H.  Smith,  "Inductive  Inference:  Theory  and  Methods."  Computing 
Surveys  15,  No  3,  1983:  237-269. 

Barrero,  Alejandro,  "Inference  of  Tree  Grammars  Using  Negative  Samples."  Pattern 
Recognition  vol  24  no.  1,  1991:  1-8. 

Barr,  A.  and  E.  A.  Feigenbaum,  The  Handbook  of  Artificial  Intelligence,  volume  I. 
Stanford,  CA:  Addison-Wesley,  1981. 

Beckman,  F.  S.,  Categorical  Notions  and  Duality  in  Automata  Theory.  IBM  Research 
Report  RC  2977,  Vienna:  IBM  Laboratory,  1970. 

Beckman,  F.  Mathematical  Foundations  of  Programming.  Reading,  MA:  Addison- 
Wesley,  1980. 

Berstel,  J.,  "Finite  Automata  and  Rational  Languages  An  Introduction."  Lecture  Notes 
in  Computer  Science:  Formal  Properties  of  Finite  Automata  and  Applications  Ed. 
J.  E.  Pin  (pp  2-14)  New  York,  NY:  Sprmger-Verlag,  1987. 

Blanchard,  Oliver,  "What  Caused  the  Last  Recession?  Consumption  and  the  Recession 
of  1990-1991."    AEA  Papers  and  Proceedings  83.  1993:  270-274. 

Blass,  Andreas.  "The  Interaction  Between  Category  Theory  and  Set  TTieory." 
Mathematical  Contemporary  Mathematics:  Applications  of  Category  Theory  vol 
30,  J.  W.  Gray  Ed.  CO:  American  Mathematical  Society,  1984  :  5-29. 

Blumer,  Anselm,  Andrzej  Ehrenfeucht,  David  Haussler  and  Manfred  Warmuth,  "Occam's 
Razor."    Information  Processing  Letters  24,  1987:  377-380. 


242 


243 

Blumer,  Anselm,  Andrzej  Ehrenfeucht,  David  Haussler  and  Manfred  Warmuth, 
"Leamability  and  the  Vapnik-Chervonenkis  Dimension."  Journal  of  the 
Association  for  Computing  Machinery  36,  1989:  929-965. 

Boose,  J.  and  B.  Gaines,  "Knowledge  Acquisition  for  Knowledge-Based  Systems:  Notes 
on  the  State-of-the-Art."  Machine  Learning  4,  1989:  377-394. 

Breiman,  L.,  J.  Friedman,  R.  Olshen  and  C.  Stone,  Classification  and  Regression  Trees. 
Belmont,  CA:  Wadsworth,  Inc.,  1984. 

Brzozowski,  J.  A.,  "Canonical  Regular  Expressions  and  Minimal  State  Graphs  for 
Definite  Events."  Proceedings  of  the  Symposium  on  the  Mathematical  Theory  of 
Automata.    Brooklyn,  NY:  Polytechnic  Institute  of  Brooklyn,  1962. 

Bunge,  Mario.  "What  is  a  Quality  of  Life  Indicator?"  Social  Indicators  Research.  2, 
1975:  65-79. 

Cochrane,  W.  and  C.  Bell.  The  Economics  of  Consumption.  New  York:  McGraw  Hill, 
1956. 

Cohen,  P.  R.  and  E.  A.  Feigenbaum,  The  Handbook  of  Artificial  Intelligence,  volume  III. 
Reading,  MA:  Addison-Wesley,  1982. 

Cooley,  W.  W.  and  P.  R.  Lohones,  Multivariate  Data  Analysis.  New  York:  John  Wiley 
&  Sons  Inc.,  1971. 

Davis,  R.  "TEIRESIAS:  Applications  of  Meta- Knowledge,  Knowledge  -based  Systems  in 
Artificial  Intelligence.  Ed.  R.  Davis  and  D.  Lenat,  New  York:  Mcgraw-Hill,  1982. 

Dejong,  G.  and  R.  Mooney,  "Explanation-Based  Learning:  An  Alternative  View." 
Machine  Learning  1,  1986:  145-176. 

Deng,  Pi-Sheng  and  Abhijit  Chaudhury,  "A  Conceptual  Model  of  Adaptive  Knowledge- 
Based  Systems."  Information  Systems  Research  3.  1992:  127-149. 

Didow,  Nicholas  M.,  W.  Perreault  and  N.  Williamson,  "A  Cross-Sectional  Optimal 
Scaling  Analysis  of  the  Index  of  Consumer  Sentiment."  Journal  of  Consumer 
Research  10,  1983:  339-347. 

Ehrenfeucht,  A.,  D.  Haussler,  M.  Kerns  and  L.  Valiant,  eds.  "A  General  Lower  Bound  on 
the  Number  of  Examples  Needed  for  Learning."  Proc.  of  the  1988  Workshop  on 
Computational  Machine  Learning.  1988.  San  Mateo,  CA:  Morgan-Kaufmann.  139- 
154. 


244 

Ehrenfeucht,  Andrzej  and  David  Haussler,  "Learning  Decision  Trees  from  Random 
Examples."    Information  and  Computation  82,  1989:  231-246. 

Eichhorn,  Wolfgang  "What  is  an  Economic  Index  ?  An  Attempt  of  an  Answer."  Theory 
and  Applications  of  Economic  Indices.  Eds.  W.  Eichhorn,  R.  Henn,  O.  Opitz,  R. 
Shephard.    Wurzburg:  Physica-Verlag,  1978.    3-42. 

Feigenbaum,  E.  A.  and  P.  McCorduck,  The  Fifth  Generation:  Artificial  Intelligence  and 
Japan's  Computer  Challenge  to  the  World.  Reading,  MA:  Addi.son-Wesley  1983. 

Flann,  N.  S.  and  T.  G.  Dietterich,  "A  Study  of  Explanation-Based  Methods  for  Inductive 
Learning."  Machine  Learning  4,  1989:  187-226. 

Freyd,  Peter  J.,  and  Andre  Scedrov,  Categories,  Allegories.  New  York,  NY:  Elsevier 
Science  Publishers  B.  V.,  1990. 

Gianotti,  Claudio  "Estimating  Unobservable  Decisions  Through  Business  Surveys: 
Preliminary  Results."  Expert  Systems  In  Economics,  Banking  and  Management. 
Eds.  L.F.  Pau,  J.  MotiwaUa,  Y.  H.  Pao,  H.  H.  Teh.  New  York,  NY:  Elsevier 
Science  Publishers,  1989.  115-124. 

Hall,  Robert  E.,  "Macro  Theory  and  the  Recession  of  1990-1991."  AEA  Papers  and 
Proceedings  83,  1993:  275-279. 

Hansen,  Gary  D.,  and  Edward  C.  Prescott,  "Did  Technology  Shocks  Cause  the  1990- 
1991  Recession?"   AEA  Papers  and  Proceedings  83.  1993:  280-286. 

Haussler,  D.,  "Quantifying  Inductive  Bias:  AI  Learning  Algorithms  and  Valiant's 
Framework."  Artificial  Intelligence  36,  1988:  177-221. 

Hodgson,  Geoffrey  M.,  "The  Reconstruction  of  Economics:  Is  There  Still  a  Place  for 
Neoclassical  Theory?"   Journal  of  Economic  Issues  27,  1992:  749-768. 

Hopcroft,  J.  and  J.  Ullman,  Introduction  to  Automata  Theory,  Languages  and 
Computation.    Reading,  MA:  Addison-Wesley  Publishing  Inc.,  1979. 

Hyafil,  L.  and  R.  Rivest,  "Constructing  optimal  binary  decision  trees  is  NP-complete." 
Information  Processing  Letters  5,  1976:  15-17. 

Juster,  F.  Thomas.  Consumer  Expectations,  Plans,  and  Purchases:  A  Progress  Report. 
New  York:  National  Bureau  of  Economic  Research,  Inc.,  1959. 

Juster,  F.  Thomas.  Anticipations  and  Purchases:  An  Analysis  of  Consumer  Behavior. 
New  York:  National  Bureau  of  Economic  Research,  Inc.,  1964. 


245 

Juster,  F.  Thomas.  Consumer  Buying  Intentions  and  Purchase  Probability:  An 
Experiment  in  Survey  Design.  New  York:  National  Bureau  of  Economic 
Research,  Occasional  Paper  99,  1966. 

Juster,  F.  Thomas,  P.  Courant  and  G.  Dow.  "A  Theoretical  Framework  for  the 
Measurement  of  Weil-Being."  The  Review  of  Income  and  Wealth.  27,  1981:  1- 
31. 

Kameda,  T.,  and  P.  Weiner,  On  the  Reduction  of  Non-Deterministic  Automata. 
Technical  Report  No.  57,  Princeton,  NJ:  Department  of  Electrical  Engineering 
Computer  Sciences  Laboratory,  Princeton  University,  1968. 

Katona,  George.   The  Powerful  Consumer.    New  York,  NY:  McGraw-Hill,  1960. 

Katona,  George  and  Eva  Mueller.  Consumer  Attitudes  and  Demand.  Ann  Arbor,  Ml: 
Survey  Research  Center  Institute  for  Social  Research,  University  of  Michigan, 
1953.  SRC  #12. 

Katona,  George  and  Eva  Mueller.  Consumer  Expectations  1953-1956.  Ann  Arbor,  MI: 
Survey  Research  Center  Institute  for  Social  Research,  University  of  Michigan, 
1956.  SRC  #16. 

Keams,  M.,  Ming  Li,  L.  Pitt  and  L.  Valiant,  "On  the  Learnability  of  Boolean  Formulae." 
Proceedings  of  the  Nineteenth  Annual  ACM  Symposium  on  Theory  of  Computing 
New  York,  NY:  Association  for  Computing  Machinery,  1987:  285-295. 

Kent,  Richard  J.,  "Household  formation  by  the  young  in  the  United  States."  Applied 
Economics  24,  1992:  1129-1138. 

Ketkar,  Suhas  L.  and  Whewon  Cho.  "Demographic  Factors  and  the  Pattern  of  Household 
Expenditures  in  the  United  States."  Atlantic  Economic  Journal.  10,  September 
1982:  16-27. 

Kohonen,  T.,  Self-Organization  and  Associative  Memory.  Second  Edition,  New  York, 
NY:  Springer- Verlag,  1988. 

Korf,  R.,  "Toward  a  Model  of  Representation  Changes."  Artificial  Intelligence  14,  1980: 
41-78. 

Krishnan.  V.  Sankrithi,  An  Introduction  to  CATEGORY  THEORY.  North  Holland: 
Elsevier  Inc.,  1981. 

Leeuwen,  Jan  Van,  Handbook  of  Theoretical  Computer  Science,  volume  B.  Cambridge, 
MA:  The  MIT  Press,  1990 


246 

Lopez  De  Mantaras,  R.,  "A  Distance-Based  Attribute  Selection  Measure  for  Decision  Tree 
Induction."  Machine  Learning  6,  199L  81-92. 

Magrabi,  Frances,  Young  Sook  Chung,  Sanghee  Sohn  Cha  and  SeJeong  Yang.  The 
Economics  of  Household  Consumption.    New  York:  Praeger  Publishers,  1991. 

Matheus,  Christopher  John,  "Feature  Construction:  An  Analytic  Framework  and  an 
Application  to  Decision  Trees."  Diss.  University  of  Dlinois  at  Urbana- 
Champaign,  1989. 

McCall,  Storrs.    "Quality  of  Life."   Social  Indicators  Research.  2,  1975:  229-248. 

Mehra,  Pankaj,  Larry  Rendell  and  Benjamin  Wah,  "Principled  Constructive  Induction." 
Proceedings  of  the  eleventh  IJCAI  Detroit,  MI,  1989:  651-656. 

Michalski,  R.  S.  and  R.  E.  Stepp,  "Automated  Construction  of  Classifications:  Conceptual 
Clustering  Versus  Numerical  Taxonomy."  EEEE  Transactions  on  Pattern  Analysis 
and  Machine  IntelUgence  PAMI-5  No.  4,  1983:  396-409. 

Mikolajczak,  B.,  Annals  of  Discrete  Mathematics:  Algebraic  and  Structural  Automata 
Theory,  New  York:  Elsevier  Science  Publishers,  1991. 

Mingers.  J.,  "An  empirical  comparison  of  selection  measures  for  decision-tree  induction." 
Machine  Learning  3,  1989:  319-342. 

Mirkin,  B.  G.,    "Dual  Automata."    Kibemetika  2,  1966:  7-10. 

Mitchell,  A.  The  Nine  American  Lifestyles:  Who  We  Are  &  Where  We  Are  Going. 
New  York:  Macmillan,  1983. 

Mitchell,  T.  M.,  "Generalization  as  Search."  Artificial  Intelligence  18,  1982:  203-226. 

Mitchell,  T.,  R.  Keller  and  S.  Kedar-Cabelli,  "Explanation  Based  Generalization:  A 
Unifying  View."  Machine  Learning  1,  1986:  47-80. 

Morwitz,  Vicki  G.,  E.  Johnson  and  D.  Schmittlein,  "Does  Measuring  Intent  Change 
Behavior?"  Journal  of  Consumer  Research  20,  1993:  46-61. 

Natarajan,  Balas  K.,  Machine  Learning:  A  Theoretical  Approach.  San  Mateo,  CA: 
Morgan  Kaufmann  Inc.,  1991. 

Nelson,  R.,  Introduction  to  Automata.  New  York:  John  Wiley  &  Sons,  Inc.,  1968. 


247 

Niemira,  Michael  P.,  "Research  Notes:  What's  the  Relationship  Among  Consumer 
Confidence  Surveys?"  Business  Economics  27,  1992:  66. 

O'Rorke,  P.,  "LT  Revisited:  Explanation-Based  Learning  and  the  Logic  of  Principia 
Mathematica."  Machine  Learning  4,  1989:  117-159. 

Pagallo,  G.,  "Adaptive  Decision  Tree  Algorithms  for  Learning  from  Examples."  Diss. 
University  of  California  at  Santa  Cruz,  1990. 

Pagallo,  G.,  and  D.  Haussler,  "Boolean  Feature  Discovery  in  Empirical  Learning." 
Machine  Learning  3,  1990:  71-99. 

Palies,  Odile  and  Jean-Marc  Philip  "Knowledge  Bases  For  Economic  Forecasting." 
Expert  Systems  In  Economics,  Banking  and  Management.  Eds.  L.F.  Pau,  J. 
Motiwalla,  Y.  H.  Pao,  H.  H.  Teh.  New  York,  NY:  Elsevier  Science  Publishers, 
1989.  109-114. 

Pau,  L.F.,  J.  Motiwalla,  Y.  H.  Pao  and  H.H.  Teh  "Artificial  Intelligence  in  Economics  and 
Management:  What  For  and  How?"  Expert  Systems  In  Economics,  Banking  and 
Management.  Eds.  L.F.  Pau,  J.  Motiwalla.  Y.  H.  Pao,  H.  H.  Teh.  New  York,  NY: 
Elsevier  Science  Publishers,  1989.  v-vii. 

Perry,  George  L.  and  Charles  L.  Schultze,  "Was  This  Recession  Different?  Are  They  All 
Different?"    Brookings  Papers  on  Economic  Activity  1,  1993:  145-212. 

Quinlan,  J.  R.,    "Induction  of  Decision  Trees."    Machine  Learning  1,  1986:  81-106. 

Qumlan,  J.  R..    "Decision  Trees  and  Decisionmaking."    IEEE  20,  1990a:  339-346. 

Quinlan,  J.  R.,  "Learning  Logical  Definitions  from  Relations."  Machine  Learning  5, 
1990b:  239-266. 

Quinlan,  J.  Ross,  C4.5:  Programs  For  Machine  Learning.  San  Mateo,  CA:  Morgan 
Kaufmann  Inc.,  1993. 

Quinlan,  J.,  and  Ronald  Rivest,  "Inferring  Decision  Trees  Using  the  Minimum  Description 
Length  Principle."    Information  and  Computation  80,  1989:  227-248. 

Ragavan,  Harish  and  Selwyn  Piramuthu,  "The  Utility  of  Feature  Construction  for  Back- 
Propagation."  Proceedings  of  the  Thirteenth  UCAI,  Chambery,  France,  1991:  844- 
850. 

Reichgelt,  H.  Knowledge  Representation:  An  Al  Perspective.  Norwood,  NJ:  Ablex 
Pubhshing  Corp.,  1991. 


248 

Rendell,  L.  "A  General  Framework  for  Induction  and  a  Study  of  Selective  Induction." 
Machine  Learning  1,  1986:  177-226. 

Rendell,  L.  "Learning  Hard  Concepts."  Proceedings  of  the  European  Workshop  in 
Learning  (EWSL-88).  New  York:  Springer- Verlag,  1988. 

Rissanen,  Jorma,  "Stochastic  Complexity  and  Modeling."  The  Annals  of  Statistics  14, 
1986:  1080-1100. 

Rivest,  R.  L.,    "Learning  Decision  Lists."    Machine  Learning  2,  1987:  229-246. 

Rodgers,  Willard  L.  and  Philip  E.  Converse.  "Measures  of  the  Perceived  Overall  Quality 
of  Life."    Social  Indicators  Research.  2,  1975:  127-152. 

Rossi,  Peter  H.,  James  D.  Wright  and  Andy  B.  Anderson.  Handbook  of  Survey  Research. 
New  York,  NY:  Academic  Press  Inc.,  1983. 

Rudin,  Walter,  Principles  of  Mathematical  Analysis.    New  York:  McGraw-Hill,  1976. 

Safavian,  S.  Rasoul  and  David  Landgrebe,  "A  Survey  of  Decision  Tree  Classifier 
Methodology."  IEEE  Transactions  on  Systems,  Man,  and  Cybernetics  21,  1991: 
660-674. 

Sawtelle,  Babara  A.,  "Income  elasticities  of  household  expenditures:  a  US  cross-section 
perspective."   Applied  Economics  25,  1993:635-644. 

Sedgewick,  Robert,  Algorithms  in  C-i-i-.    Reading,  MA:  Addison-Wesley,  1992. 

Stoll,  Robert  R.,  Set  Theory  and  Logic.   New  York,  NY:  Dover  Publications,  Inc.,  1979. 

Suranyi-Unger,  T.  Jr.  Identification  of  Standard  Classes  in  the  United  States  1977. 
Washington,  DC:  National  Science  Foundation. 

Survey  Research  Center  of  the  University  of  Michigan.  Consumer  Optimism  Weakening 
May-June  1957.  Ann  Arbor.  MI:  Foundation  for  Research  on  Human  Behavior, 
1957. 

U.S.  Bureau  of  Labor  Statistics.  Consumer  Expenditure  Survey:  Integrated  Survey  Data, 
1984-1986.    Washington.  DC:  U.S.  Govemment  Printing  Office,  1989. 

Vere,  A.  S.,  "Multilevel  Counterfactuals  for  Generalization  of  Relational  Concepts  and 
Productions."    Artificial  Intelligence  14,  1980:  139-165. 


249 

Valiant,  L.  G.,  "A  Theory  of  the  Leamable."  Communications  of  the  ACM  27,  1984: 
1134-1142. 

Wagner,  Janet  and  Sherman  Hanna.  "The  Effectiveness  of  Family  Life  Cycle  Variables 
in  Consumer  Expenditure  Research."  Journal  of  Consumer  Research.  10,  1983: 
281-291. 

Wegener,  Ingo,  The  Complexity  of  Boolean  Functions.  New  York:  John  Wiley  &  Sons, 
1987. 

Weiss,  Shalom  M.  and  Casimir  A.  Kulikowski,  Computer  Systems  That  Learn: 
Classification  and  Prediction  Methods  from  Statistics,  Neural  Nets,  Machine 
Learning,  and  Expert  Systems.  San  Mateo,  CA:  Morgan  Kaufmann  Inc.,  1991. 

Yang,  D.,  L.  Rendell  and  G.  Blix,  "A  Scheme  for  Feature  Construction  and  a  Comparison 
of  Empirical  Methods"  Proceedings  of  the  Twelfth  IJCAl.  Sidney,  Australia,  1991: 
699-704. 

Zarnowitz,  Victor.  "An  Appraisal  of  Short-Term  Economic  Forecasts".  New  York: 
National  Bureau  of  Economic  Research,  Inc.,  Occasional  Paper  104,  1967. 

Zarnowitz,  Victor.  The  Business  Cycle  Today.  New  York:  National  Bureau  of  Economic 
Research,  1972. 


BIOGRAPHICAL  SKETCH 

Raymond  L.  Major  was  born  in  Lanes,  South  Carolina,  on  the  18th  of  July  1955. 
Shortly  after  his  first  birthday,  the  Major  family  moved  to  Jacksonville,  Florida,  where 
he  attended  public  schools  and  graduated  from  Robert  E.  Lee  Senior  High  School.  His 
Bachelor  of  Science  in  Electrical  Engineering  and  Master  of  Business  Administration 
degrees  were  both  received  from  the  University  of  Florida. 

His  work  experience  varied;  however,  his  professional  experience  is  centered 
around  his  favorite  subject-electronics.  He  always  gave  praise  to  God  in  Heaven  for  not 
letting  the  racial  attitudes  of  his  time  disrupt  many  of  his  childhood  dreams. 


250 


I  certify  that  I  have  read  this  study  and  that  in  my  opinion  it  conforms  to 
acceptable  standards  of  scholarly  presentation  and  is  fully  adequate,  in  scope  and  quality, 
as  a  dissertation  for  the  degree  of  Doctor  of  Philosophy. 


Gary  K©eKler,  Chairperson 
Professor  of  Decision  and 
Information  Sciences 

I  certify  that  I  have  read  this  study  and  that  in  my  opinion  it  conforms  to 
acceptable  standards  of  scholarly  presentation  and  is  fully  adequate,  in  scope  and  quality, 
as  a  dissertation  for  the  degree  of  Doctor  of  Philosophy. 


Selcuk  Erenguc 
Professor  of  E)ecision  and 
Information  Sciences 

I  certify  that  I  have  read  this  study  and  that  in  my  opinion  it  conforms  to 
acceptable  standards  of  scholarly  presentation  and  is  fully  adequate,  in  scope  and  quality, 
as  a  dissertation  for  the  degree  of  Doctor  of  Philosophy. 


an  61 


.<jt 


Selwyn  Piramuthu^ 
Assistant  Professor  of  Decision 
and  Information  Sciences 

I  certify  that  I  have  read  this  study  and  that  in  my  opinion  it  conforms  to 
acceptable  standards  of  scholarly  presentation  and  is  fully  adequate,  in  scope  and  quality, 
as  a  dissertation  for  the  degree  of  Doctor  of  Philosophy. 


Patri^'k  Thompson 
Assiitatrt' Professor  of  Decision 
and  Information  Sciences 


I  certify  that  1  have  read  this  study  and  that  in  my  opinion  it  conforms  to 
acceptable  standards  of  scholarly  presentation  and  is  fully  adequate,  in  scope  and  quality, 
as  a  dissertation  for  the  degree  of  Doctor  of  Philosophy. 

David  Denslow,  External  Member 
Distinguished  Service  Professor 
of  Economics 

This  dissertation  was  submitted  to  the  Graduate  Faculty  of  the  Department  of 
Decision  and  Information  Sciences  in  the  College  of  Business  Administration  and  to  the 
Graduate  School  and  was  accepted  as  partial  fulfillment  of  the  requirements  for  the  degree 
of  Doctor  of  Philosophy. 

August    1994  


Dean,  Graduate  School 


/994 


UNIVERSITY  OF  FLORIDA 


3  1262  08553  8634 


